Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

SpinLock.TryEnter fail fast for timeout 0#6944

Closed
benaadams wants to merge 1 commit into
dotnet:masterfrom
benaadams:spinlock-failfast
Closed

SpinLock.TryEnter fail fast for timeout 0#6944
benaadams wants to merge 1 commit into
dotnet:masterfrom
benaadams:spinlock-failfast

Conversation

@benaadams
Copy link
Copy Markdown
Member

Previously the timeout 0 would Interlocked.Add to set the waiters then CAS spin to unset it immediately after; now it exits before trying to set the waiters so skips both.

Added a Thread.SpinWait(1) when the thread didn't yield.

Changed the spin calculation to do the same thing (fairly sure) but in less operations per iteration.

Changed the exit mechanism to use break rather than inline returns as it generates less asm, for popping the registers - makes it less clear though :-/

Improved handling for high waiter count; though it would be in the billion range so hopefully would never be hit on a spinwait anyway (otherwise something has gone very wrong).

Moved yielding to the start of the yield loop; as if you've got there you've already just tried to acquire the lock.

Moved the time check after spinning into the if; as if you've skipped spinning you haven't really done anything yet so no point in checking if timed out by >= 1 millisecond (and 0 zero timeout fast-paths at the start).

Changed the spin type thresholds to be powers of 2 and changed the %/idiv/mod to &. Which means: Sleep(1) moved from 40 -> 64; Sleep(0) moved from 10 -> 16

Added comments throughout, also corrected some strange spellings.

Trims the asm from 1107 bytes of instructions to 890; and jit local vars from 31 to 24 (more loc, less tmp and less cse)

Also indentation changes due to some of the rearrangement so &w=1 is better for a compare.

Passes corefx tests

1M iters (single thread, uncontended but locked) for code

bool lockTaken = false;
var s = new SpinLock(false);
s.Enter(ref lockTaken);
method pre (ms) post (ms) improvement
s.TryEnter(0, ref lockTaken) 24.55 5.95 x 4.1

Adding threadpool perf timings to gist and will post highlights

@stephentoub follow up on #6911

@benaadams
Copy link
Copy Markdown
Member Author

benaadams commented Aug 26, 2016

Results using https://github.com/benaadams/ThreadPoolTaskTesting
results gist

Individual result variances due to threading timing and GC; also TrySteal uses a random factor; so chart trends are the more significant (bunch of different sections to follow am grouping them together by type)

SubTask Chain Return 
Testing 2,621,440 calls, with GCs after 262,144 calls.
Operations per second
                                                                           Parallelism
                             Serial          2x         16x         64x        512x
i5 4 core no HT
Pre
SubTask Chain Return      615.388 k     1.027 M     4.280 M     3.964 M     4.547 M
- Depth    2              548.552 k   540.726 k     3.375 M     3.340 M     4.557 M
- Depth   16              545.361 k     1.323 M     4.195 M     4.469 M     5.397 M
- Depth   64              247.245 k     1.787 M     3.118 M     4.696 M     5.510 M
- Depth  512              172.549 k   803.055 k     3.405 M     5.498 M     5.731 M

Post
SubTask Chain Return      702.488 k     1.629 M     4.319 M     4.314 M     4.336 M
- Depth    2              731.134 k     1.957 M     4.530 M     4.665 M     4.651 M
- Depth   16              975.378 k     2.413 M     4.836 M     5.498 M     5.534 M
- Depth   64              972.742 k     2.541 M     5.291 M     5.552 M     5.603 M
- Depth  512              989.954 k     2.578 M     5.277 M     5.585 M     5.684 M

Improvements across the board

i5 4 core no HT - 500 min thread
Pre
SubTask Chain Return      960.999 k     1.179 M     3.479 M     4.361 M     4.614 M
- Depth    2                1.108 M     1.281 M     3.736 M     4.268 M     4.941 M
- Depth   16                1.303 M     1.548 M     3.988 M     5.475 M     5.774 M
- Depth   64                1.385 M     1.575 M     4.064 M     5.335 M     5.815 M
- Depth  512                1.095 M     2.810 M     4.828 M     5.739 M     5.781 M

Post
SubTask Chain Return      904.786 k   990.460 k     3.251 M     4.159 M     4.386 M
- Depth    2                1.027 M     1.232 M     3.657 M     4.464 M     4.834 M
- Depth   16                1.232 M     1.475 M     3.901 M     5.385 M     5.763 M
- Depth   64                1.053 M     2.075 M     4.172 M     5.384 M     5.827 M
- Depth  512                1.175 M     2.851 M     4.735 M     5.590 M     5.749 M

Some regression

i7 4 core 8 HT
Pre
SubTask Chain Return      615.606 k   813.288 k     4.670 M     5.121 M     5.612 M
- Depth    2              696.861 k   871.519 k     5.813 M     5.705 M     5.986 M
- Depth   16              354.586 k   746.973 k     5.190 M     6.362 M     6.378 M
- Depth   64              435.186 k   778.422 k     5.410 M     5.533 M     6.383 M
- Depth  512              464.164 k   730.233 k     5.205 M     6.814 M     6.632 M

Post
SubTask Chain Return      604.834 k     1.133 M     5.359 M     5.641 M     5.724 M
- Depth    2              691.719 k     1.281 M     6.097 M     5.827 M     6.237 M
- Depth   16              724.503 k     1.617 M     6.529 M     6.639 M     6.847 M
- Depth   64              736.150 k     1.650 M     6.758 M     6.930 M     6.841 M
- Depth  512              824.113 k     1.765 M     6.943 M     6.858 M     6.964 M

Improvements across the board

i7 4 core 8 HT - 500 min thread
Pre
SubTask Chain Return      552.093 k   655.964 k     2.293 M     4.387 M     5.167 M
- Depth    2              597.881 k   699.831 k     3.099 M     5.356 M     5.658 M
- Depth   16              690.066 k   769.582 k     4.011 M     5.486 M     6.395 M
- Depth   64              749.703 k   800.779 k     3.956 M     5.607 M     6.452 M
- Depth  512              645.235 k     1.495 M     5.354 M     6.382 M     6.429 M

Post
SubTask Chain Return      618.259 k   619.092 k     2.446 M     4.590 M     5.327 M
- Depth    2              652.892 k   711.321 k     2.991 M     5.413 M     5.714 M
- Depth   16              741.617 k   827.076 k     3.781 M     5.636 M     6.414 M
- Depth   64              769.178 k   783.729 k     4.486 M     6.226 M     6.475 M
- Depth  512              687.981 k     1.434 M     6.388 M     6.554 M     6.634 M

General Improvements

@benaadams
Copy link
Copy Markdown
Member Author

benaadams commented Aug 26, 2016

SubTask Chain Awaited
Testing 2,621,440 calls, with GCs after 262,144 calls.
Operations per second
                                                                           Parallelism
                             Serial          2x         16x         64x        512x
i5 4 core no HT
Pre
SubTask Chain Awaited     433.582 k   698.576 k     2.355 M     2.459 M     2.470 M
- Depth    2              447.282 k   754.095 k     2.417 M     2.626 M     2.730 M
- Depth   16              384.794 k   933.787 k     2.202 M     2.670 M     2.930 M
- Depth   64              453.963 k     1.106 M     2.340 M     2.778 M     2.870 M
- Depth  512              507.166 k   960.942 k     2.625 M     2.510 M     2.855 M

Post
SubTask Chain Awaited     472.009 k     1.241 M     2.416 M     2.618 M     2.664 M
- Depth    2              565.431 k     1.511 M     2.638 M     2.813 M     2.817 M
- Depth   16              792.927 k     1.761 M     3.002 M     3.317 M     3.271 M
- Depth   64              775.420 k     1.820 M     3.179 M     3.169 M     3.346 M
- Depth  512              873.529 k     1.876 M     2.920 M     3.179 M     3.107 M

Improvements across the board

i5 4 core no HT - 500 min thread
Pre
SubTask Chain Awaited     603.821 k   702.970 k     1.794 M     2.377 M     2.729 M
- Depth    2              653.331 k   713.702 k     1.871 M     2.693 M     2.895 M
- Depth   16              739.070 k   797.390 k     2.197 M     2.793 M     3.229 M
- Depth   64              797.874 k     1.210 M     2.484 M     3.003 M     3.260 M
- Depth  512              758.383 k     1.546 M     2.777 M     2.971 M     2.941 M

Post
SubTask Chain Awaited     596.226 k   665.390 k     1.788 M     2.294 M     2.687 M
- Depth    2              651.716 k   696.489 k     1.873 M     2.602 M     2.909 M
- Depth   16              729.769 k   924.021 k     2.329 M     2.935 M     3.347 M
- Depth   64              807.734 k     1.292 M     2.528 M     3.076 M     3.345 M
- Depth  512              769.498 k     1.687 M     2.701 M     2.878 M     3.099 M

Mixed

i7 4 core 8 HT
Pre
SubTask Chain Awaited     520.123 k   647.306 k     3.040 M     3.299 M     3.626 M
- Depth    2              462.123 k   666.259 k     3.143 M     3.554 M     3.626 M
- Depth   16              413.802 k   934.982 k     3.192 M     3.777 M     4.064 M
- Depth   64              454.067 k   792.816 k     3.330 M     3.584 M     4.014 M
- Depth  512              486.897 k   696.634 k     2.782 M     3.272 M     3.624 M

Post
SubTask Chain Awaited     503.086 k   857.364 k     3.309 M     3.326 M     3.528 M
- Depth    2              514.673 k   920.604 k     3.225 M     3.573 M     3.697 M
- Depth   16              546.228 k     1.257 M     3.673 M     3.908 M     3.981 M
- Depth   64              607.220 k     1.280 M     3.774 M     3.968 M     4.008 M
- Depth  512              709.936 k     1.463 M     3.235 M     3.242 M     3.478 M

Improvements across the board

i7 4 core 8 HT - 500 min thread
Pre
SubTask Chain Awaited     379.956 k   418.250 k     1.942 M     2.399 M     3.190 M
- Depth    2              397.497 k   436.692 k     1.760 M     2.355 M     3.455 M
- Depth   16              454.366 k   484.191 k     1.341 M     2.880 M     3.757 M
- Depth   64              449.875 k   889.808 k     1.931 M     3.119 M     3.819 M
- Depth  512              424.484 k   912.257 k     2.365 M     2.933 M     3.209 M

Post
SubTask Chain Awaited     382.641 k   398.446 k     2.056 M     2.400 M     3.197 M
- Depth    2              404.685 k   447.393 k     2.037 M     2.570 M     3.516 M
- Depth   16              462.104 k   490.836 k     1.285 M     2.884 M     3.795 M
- Depth   64              451.043 k   899.644 k     1.904 M     3.112 M     3.822 M
- Depth  512              429.539 k   946.724 k     2.314 M     2.982 M     3.358 M

Similar

@benaadams
Copy link
Copy Markdown
Member Author

benaadams commented Aug 27, 2016

SubTask Fanout Awaited
Testing 2,621,440 calls, with GCs after 262,144 calls.
Operations per second
                                                                           Parallelism
                             Serial          2x         16x         64x        512x
i5 4 core no HT
Pre
SubTask Fanout Awaited    244.211 k   568.096 k     1.384 M     1.404 M     1.399 M
- Depth    2              522.159 k     1.053 M     1.722 M     1.722 M     1.735 M
- Depth   16                1.330 M     2.137 M     2.342 M     2.412 M     2.343 M
- Depth   64                1.789 M     2.281 M     2.487 M     2.507 M     2.488 M
- Depth  512                1.923 M     2.353 M     2.557 M     2.556 M     2.497 M

Post
SubTask Fanout Awaited    287.239 k   816.204 k     1.455 M     1.472 M     1.478 M
- Depth    2              598.324 k     1.310 M     1.832 M     1.863 M     1.856 M
- Depth   16                1.536 M     2.271 M     2.488 M     2.547 M     2.464 M
- Depth   64                1.856 M     2.472 M     2.686 M     2.679 M     2.690 M
- Depth  512                2.048 M     2.528 M     2.715 M     2.769 M     2.711 M

Some Improvement

i5 4 core no HT - 500 min thread
Pre
SubTask Fanout Awaited    346.810 k   380.862 k   953.233 k     1.236 M     1.453 M
- Depth    2              497.513 k   521.983 k     1.272 M     1.643 M     1.842 M
- Depth   16              906.260 k     1.209 M     2.179 M     2.458 M     2.404 M
- Depth   64                1.088 M     1.250 M     2.291 M     2.567 M     2.602 M
- Depth  512                1.121 M     1.516 M     2.430 M     2.605 M     2.593 M

Post
SubTask Fanout Awaited    346.806 k   382.608 k   833.971 k     1.251 M     1.400 M
- Depth    2              496.223 k   558.188 k     1.250 M     1.675 M     1.844 M
- Depth   16              931.701 k     1.165 M     2.191 M     2.485 M     2.483 M
- Depth   64                1.129 M     1.394 M     2.357 M     2.627 M     2.590 M
- Depth  512                1.184 M     1.450 M     2.441 M     2.672 M     2.619 M

Similar

i7 4 core 8 HT
Pre
SubTask Fanout Awaited    252.416 k   385.534 k     1.831 M     2.062 M     2.167 M
- Depth    2              395.468 k   844.312 k     2.556 M     2.589 M     2.632 M
- Depth   16                1.369 M     2.439 M     3.316 M     3.377 M     3.315 M
- Depth   64                1.763 M     2.797 M     3.457 M     3.510 M     3.420 M
- Depth  512                2.014 M     2.898 M     3.498 M     3.496 M     3.522 M

Post
SubTask Fanout Awaited    280.779 k   516.919 k     2.079 M     2.089 M     2.090 M
- Depth    2              418.349 k     1.039 M     2.415 M     2.519 M     2.543 M
- Depth   16                1.382 M     2.456 M     3.164 M     3.247 M     3.182 M
- Depth   64                1.872 M     2.783 M     3.271 M     3.317 M     3.284 M
- Depth  512                2.147 M     2.899 M     3.420 M     3.355 M     3.373 M

Mixed

i7 4 core 8 HT - 500 min thread
Pre
SubTask Fanout Awaited    229.300 k   240.043 k   839.135 k     1.444 M     1.925 M
- Depth    2              342.694 k   367.394 k     1.381 M     1.758 M     2.444 M
- Depth   16              683.237 k   750.378 k     2.494 M     2.881 M     3.200 M
- Depth   64              837.230 k   877.774 k     2.660 M     3.043 M     3.234 M
- Depth  512              887.270 k   912.580 k     2.848 M     2.973 M     3.092 M

Post
SubTask Fanout Awaited    238.340 k   257.873 k   900.434 k     1.510 M     1.946 M
- Depth    2              355.021 k   386.566 k     1.330 M     1.905 M     2.453 M
- Depth   16              687.540 k   750.731 k     2.414 M     2.978 M     3.074 M
- Depth   64              797.330 k   951.923 k     2.581 M     3.058 M     3.229 M
- Depth  512              877.081 k     1.113 M     2.784 M     3.036 M     3.180 M

Similar

@benaadams
Copy link
Copy Markdown
Member Author

benaadams commented Aug 27, 2016

Continuation Chain
Testing 2,621,440 calls, with GCs after 262,144 calls.
Operations per second
                                                                           Parallelism
                             Serial          2x         16x         64x        512x
i5 4 core no HT
Pre
Continuation Chain        150.844 k   396.001 k     1.376 M     1.445 M     1.489 M
- Depth    2              349.510 k   881.096 k     2.158 M     2.603 M     2.602 M
- Depth   16              806.582 k     2.819 M     6.803 M     7.105 M     7.040 M
- Depth   64                1.246 M     4.002 M     8.628 M     8.530 M     8.385 M
- Depth  512                1.139 M     4.659 M     8.354 M     8.039 M     7.982 M

Post
Continuation Chain        240.058 k   644.372 k     1.574 M     1.603 M     1.592 M
- Depth    2              395.840 k     1.098 M     2.767 M     2.743 M     2.753 M
- Depth   16              941.757 k     3.001 M     7.380 M     7.372 M     7.311 M
- Depth   64                1.121 M     3.823 M     8.920 M     8.903 M     8.787 M
- Depth  512                1.215 M     4.120 M     9.444 M     9.368 M     9.309 M

Improvements across the board

i5 4 core no HT - 500 min thread
Pre
Continuation Chain        364.895 k   393.347 k   999.829 k     1.378 M     1.589 M
- Depth    2              609.613 k   665.320 k     2.180 M     2.565 M     2.754 M
- Depth   16                1.643 M     1.885 M     7.073 M     7.191 M     7.208 M
- Depth   64                2.159 M     2.604 M     8.177 M     8.539 M     8.534 M
- Depth  512                2.619 M     3.775 M     8.871 M     9.109 M     9.006 M

Post
Continuation Chain        371.949 k   402.795 k     1.008 M     1.358 M     1.599 M
- Depth    2              623.992 k   677.816 k     2.133 M     2.584 M     2.729 M
- Depth   16                1.662 M     1.931 M     7.047 M     7.287 M     7.277 M
- Depth   64                2.166 M     2.466 M     7.913 M     8.679 M     8.651 M
- Depth  512                2.672 M     3.365 M     9.000 M     9.216 M     9.143 M

Mixed/Improve

i7 4 core 8 HT
Pre
Continuation Chain        180.548 k   251.513 k     2.131 M     2.195 M     2.188 M
- Depth    2              297.147 k   421.529 k     2.925 M     3.684 M     3.681 M
- Depth   16              667.285 k     1.060 M     8.785 M     8.705 M     8.659 M
- Depth   64              801.636 k     1.205 M    10.172 M    10.028 M    10.014 M
- Depth  512              815.435 k     1.323 M     8.936 M    10.179 M     9.822 M

Post
Continuation Chain        197.906 k   413.404 k     2.048 M     2.080 M     2.129 M
- Depth    2              320.715 k   685.434 k     3.586 M     3.496 M     3.549 M
- Depth   16              740.831 k     1.540 M     8.183 M     8.144 M     8.146 M
- Depth   64              870.924 k     1.818 M     9.605 M     9.576 M     9.531 M
- Depth  512              916.890 k     1.912 M     9.919 M     9.955 M     9.755 M

Mostly Improved

i7 4 core 8 HT - 500 min thread
Pre
Continuation Chain        232.648 k   242.818 k   744.910 k     1.554 M     2.030 M
- Depth    2              383.585 k   399.078 k     1.434 M     2.808 M     3.432 M
- Depth   16              961.601 k     1.051 M     7.114 M     8.001 M     6.631 M
- Depth   64                1.255 M     1.441 M     7.812 M     9.026 M     9.600 M
- Depth  512                1.430 M     1.620 M     9.122 M     9.506 M     9.533 M

Post
Continuation Chain        241.044 k   260.005 k   923.966 k     1.604 M     2.068 M
- Depth    2              387.877 k   422.354 k     1.600 M     2.889 M     3.449 M
- Depth   16              964.811 k     1.063 M     5.804 M     7.894 M     8.224 M
- Depth   64                1.247 M     1.497 M     7.797 M     9.059 M     9.435 M
- Depth  512                1.440 M     1.360 M     9.178 M     9.643 M     9.509 M

Mostly Improved

@benaadams
Copy link
Copy Markdown
Member Author

benaadams commented Aug 27, 2016

Continuation Fanout
Testing 2,621,440 calls, with GCs after 262,144 calls.
Operations per second
                                                                           Parallelism
                             Serial          2x         16x         64x        512x
i5 4 core no HT
Pre
Continuation Fanout       133.352 k   320.861 k   989.129 k     1.120 M     1.122 M
- Depth    2              202.924 k   631.553 k     1.605 M     1.741 M     1.805 M
- Depth   16              937.335 k     1.853 M     4.320 M     4.300 M     4.321 M
- Depth   64                1.845 M     2.452 M     5.397 M     5.258 M     5.210 M
- Depth  512                2.742 M     3.868 M     5.554 M     5.733 M     5.528 M

Post
Continuation Fanout       196.591 k   558.073 k     1.134 M     1.148 M     1.153 M
- Depth    2              395.617 k     1.006 M     1.784 M     1.786 M     1.835 M
- Depth   16                1.974 M     3.629 M     4.398 M     4.346 M     4.323 M
- Depth   64                3.368 M     5.087 M     5.390 M     5.388 M     5.313 M
- Depth  512                3.892 M     5.612 M     5.616 M     5.650 M     5.679 M

Improvements across the board

i5 4 core no HT - 500 min thread
Pre
Continuation Fanout       265.336 k   292.083 k   553.332 k   934.454 k     1.145 M
- Depth    2              414.743 k   442.279 k     1.111 M     1.482 M     1.686 M
- Depth   16                1.184 M     1.555 M     3.781 M     4.061 M     4.146 M
- Depth   64                1.697 M     2.107 M     4.965 M     5.097 M     5.156 M
- Depth  512                1.639 M     2.521 M     5.220 M     5.438 M     5.410 M

Post
Continuation Fanout       259.659 k   283.807 k   604.670 k   910.191 k     1.120 M
- Depth    2              408.807 k   439.765 k     1.171 M     1.534 M     1.658 M
- Depth   16                1.162 M     1.525 M     3.936 M     4.244 M     4.193 M
- Depth   64                1.827 M     2.292 M     5.068 M     5.166 M     5.208 M
- Depth  512                1.777 M     2.850 M     5.278 M     5.101 M     5.470 M

Generally improved

i7 4 core 8 HT
Pre
Continuation Fanout       167.245 k   242.750 k     1.541 M     1.589 M     1.633 M
- Depth    2              189.065 k   411.591 k     2.151 M     2.442 M     2.546 M
- Depth   16                1.162 M     2.604 M     5.645 M     5.608 M     5.620 M
- Depth   64                1.992 M     4.694 M     6.578 M     6.514 M     6.515 M
- Depth  512                1.960 M     4.778 M     6.679 M     6.856 M     6.914 M

Post
Continuation Fanout       166.887 k   342.927 k     1.525 M     1.555 M     1.588 M
- Depth    2              279.829 k   676.619 k     2.276 M     2.369 M     2.422 M
- Depth   16                1.647 M     3.597 M     5.266 M     5.263 M     5.205 M
- Depth   64                1.899 M     4.429 M     6.073 M     6.068 M     6.041 M
- Depth  512                1.826 M     4.462 M     6.179 M     6.297 M     6.352 M

Some regression

i7 4 core 8 HT - 500 min thread
Pre
Continuation Fanout       176.889 k   189.812 k   633.173 k     1.062 M     1.457 M
- Depth    2              262.854 k   282.592 k     1.067 M     1.631 M     2.218 M
- Depth   16              871.006 k   888.298 k     4.070 M     4.947 M     5.228 M
- Depth   64                1.079 M     1.320 M     4.848 M     3.746 M     6.045 M
- Depth  512                1.029 M     1.168 M     5.116 M     6.286 M     5.959 M

Post
Continuation Fanout       178.117 k   190.339 k   533.092 k     1.079 M     1.481 M
- Depth    2              274.058 k   276.105 k     1.379 M     1.863 M     2.271 M
- Depth   16              837.715 k   988.226 k     4.059 M     4.942 M     5.195 M
- Depth   64                1.063 M     1.163 M     5.286 M     5.815 M     6.008 M
- Depth  512              964.998 k     1.463 M     3.813 M     5.905 M     5.899 M

Some regression

@benaadams
Copy link
Copy Markdown
Member Author

benaadams commented Aug 27, 2016

Yield Chain Awaited
Testing 2,621,440 calls, with GCs after 262,144 calls.
Operations per second
                                                                           Parallelism
                             Serial          2x         16x         64x        512x
i5 4 core no HT
Pre
Yield Chain Awaited       726.850 k     1.657 M     4.014 M     4.041 M     4.072 M
- Depth    2              904.511 k     2.197 M     4.601 M     4.617 M     4.921 M
- Depth   16                1.316 M     4.273 M     5.977 M     6.086 M     5.955 M
- Depth   64                2.305 M     4.647 M     6.198 M     6.120 M     5.847 M
- Depth  512                2.646 M     5.246 M     6.135 M     5.922 M     4.847 M

Post
Yield Chain Awaited       734.307 k     1.774 M     4.025 M     4.005 M     4.078 M
- Depth    2                1.073 M     2.604 M     4.519 M     4.786 M     4.910 M
- Depth   16                2.008 M     4.589 M     5.910 M     5.947 M     5.968 M
- Depth   64                2.392 M     5.033 M     6.206 M     6.054 M     5.867 M
- Depth  512                2.520 M     5.350 M     6.130 M     5.859 M     4.844 M

Similar

i5 4 core no HT - 500 min thread
Pre
Yield Chain Awaited       726.850 k     1.657 M     4.014 M     4.041 M     4.072 M
- Depth    2              904.511 k     2.197 M     4.601 M     4.617 M     4.921 M
- Depth   16                1.316 M     4.273 M     5.977 M     6.086 M     5.955 M
- Depth   64                2.305 M     4.647 M     6.198 M     6.120 M     5.847 M
- Depth  512                2.646 M     5.246 M     6.135 M     5.922 M     4.847 M

Post
Yield Chain Awaited       734.307 k     1.774 M     4.025 M     4.005 M     4.078 M
- Depth    2                1.073 M     2.604 M     4.519 M     4.786 M     4.910 M
- Depth   16                2.008 M     4.589 M     5.910 M     5.947 M     5.968 M
- Depth   64                2.392 M     5.033 M     6.206 M     6.054 M     5.867 M
- Depth  512                2.520 M     5.350 M     6.130 M     5.859 M     4.844 M

Similar

i7 4 core 8 HT
Pre
Yield Chain Awaited       742.757 k     1.196 M     5.133 M     5.180 M     6.125 M
- Depth    2              895.473 k     1.750 M     4.453 M     5.342 M     7.016 M
- Depth   16                1.270 M     2.632 M     5.751 M     7.500 M     7.589 M
- Depth   64                1.389 M     3.062 M     7.408 M     7.289 M     7.189 M
- Depth  512                1.360 M     3.220 M     7.098 M     6.938 M     6.097 M

Post
Yield Chain Awaited       716.754 k     1.170 M     4.969 M     5.012 M     5.776 M
- Depth    2              864.528 k     1.750 M     4.586 M     5.095 M     6.584 M
- Depth   16                1.295 M     2.941 M     5.971 M     7.078 M     7.306 M
- Depth   64                1.324 M     3.187 M     6.644 M     6.965 M     6.947 M
- Depth  512                1.328 M     3.175 M     6.922 M     6.764 M     5.780 M

Similar

i7 4 core 8 HT - 500 min thread
Pre
Yield Chain Awaited       740.148 k   820.592 k     3.253 M     4.770 M     5.765 M
- Depth    2              885.823 k   934.844 k     3.401 M     5.006 M     6.691 M
- Depth   16                1.237 M     1.374 M     4.895 M     7.197 M     7.401 M
- Depth   64                1.763 M     2.115 M     7.189 M     7.115 M     6.289 M
- Depth  512                1.847 M     2.471 M     6.297 M     6.811 M     4.975 M

Post
Yield Chain Awaited       751.905 k   817.659 k     3.333 M     4.805 M     5.723 M
- Depth    2              892.455 k   965.113 k     3.250 M     4.981 M     6.758 M
- Depth   16                1.209 M     1.429 M     4.837 M     7.090 M     7.384 M
- Depth   64                1.763 M     2.034 M     6.819 M     6.962 M     6.982 M
- Depth  512                1.840 M     2.363 M     6.828 M     6.508 M     4.738 M

Similar

@benaadams
Copy link
Copy Markdown
Member Author

benaadams commented Aug 27, 2016

Async Chain Awaited
Testing 2,621,440 calls, with GCs after 262,144 calls.
Operations per second
                                                                           Parallelism
                             Serial          2x         16x         64x        512x
i5 4 core no HT
Pre
Async Chain Awaited       667.010 k     1.066 M     2.698 M     3.290 M     3.427 M
- Depth    2                1.079 M     1.707 M     4.558 M     5.069 M     5.155 M
- Depth   16                2.025 M     4.175 M     7.862 M     8.132 M     8.082 M
- Depth   64                2.121 M     4.281 M     8.181 M     8.436 M     8.548 M
- Depth  512                2.224 M     4.229 M     6.627 M     6.726 M     8.136 M

Post
Async Chain Awaited       675.204 k     1.294 M     3.245 M     3.371 M     3.357 M
- Depth    2                1.053 M     1.957 M     4.487 M     4.737 M     4.955 M
- Depth   16                2.086 M     4.200 M     7.830 M     8.046 M     8.028 M
- Depth   64                2.121 M     4.311 M     8.262 M     8.673 M     8.622 M
- Depth  512                2.263 M     4.557 M     6.806 M     7.068 M     8.276 M

Slight improvement

i5 4 core no HT - 500 min thread
Pre
Async Chain Awaited       814.231 k   971.207 k     2.639 M     3.163 M     3.255 M
- Depth    2                1.178 M     1.366 M     3.844 M     4.842 M     5.055 M
- Depth   16                1.900 M     2.462 M     5.705 M     7.624 M     7.868 M
- Depth   64                2.091 M     4.300 M     7.063 M     8.118 M     8.244 M
- Depth  512                2.096 M     3.948 M     6.362 M     6.438 M     8.273 M

Post
Async Chain Awaited       820.718 k   930.799 k     2.407 M     3.241 M     3.443 M
- Depth    2                1.170 M     1.308 M     3.640 M     4.187 M     4.974 M
- Depth   16                1.885 M     2.278 M     6.228 M     7.428 M     7.967 M
- Depth   64                2.048 M     3.987 M     7.209 M     8.235 M     8.288 M
- Depth  512                2.098 M     4.025 M     6.611 M     6.647 M     8.018 M

Slight improvement

i7 4 core 8 HT
Pre
Async Chain Awaited       550.323 k   805.145 k     4.696 M     4.723 M     4.448 M
- Depth    2                1.222 M     1.511 M     6.661 M     6.695 M     6.721 M
- Depth   16                2.122 M     3.232 M    10.887 M    10.924 M    10.872 M
- Depth   64                2.626 M     3.743 M    11.674 M    11.688 M    11.672 M
- Depth  512                2.487 M     4.818 M    10.701 M    10.938 M    11.753 M

Post
Async Chain Awaited       550.662 k   923.651 k     4.446 M     4.398 M     4.502 M
- Depth    2                1.241 M     1.455 M     5.934 M     6.400 M     6.447 M
- Depth   16                2.172 M     3.229 M    10.462 M    10.646 M    10.709 M
- Depth   64                2.763 M     3.812 M    11.581 M    11.515 M    11.416 M
- Depth  512                2.499 M     4.697 M    10.836 M    10.934 M    11.635 M

Similar

i7 4 core 8 HT - 500 min thread
Pre
Async Chain Awaited       498.238 k   557.739 k     1.449 M     3.662 M     4.255 M
- Depth    2              687.844 k   719.451 k     3.449 M     5.298 M     6.306 M
- Depth   16                1.407 M     2.447 M    10.275 M    10.227 M    10.183 M
- Depth   64                2.388 M     3.425 M    10.658 M    11.123 M    10.963 M
- Depth  512                2.061 M     3.709 M    10.247 M    10.340 M    11.292 M

Post
Async Chain Awaited       525.209 k   485.608 k     1.659 M     2.565 M     4.148 M
- Depth    2              728.118 k   784.423 k     3.815 M     5.217 M     6.249 M
- Depth   16                1.411 M     2.478 M     9.182 M     9.489 M    10.042 M
- Depth   64                2.404 M     3.400 M    10.544 M    10.904 M    11.009 M
- Depth  512                2.086 M     3.764 M    10.349 M    10.510 M    11.325 M

Similar

@benaadams
Copy link
Copy Markdown
Member Author

benaadams commented Aug 27, 2016

QUWI Local Queues
Testing 2,621,440 calls, with GCs after 262,144 calls.
Operations per second
                                                                           Parallelism
                             Serial          2x         16x         64x        512x
i5 4 core no HT
Pre
QUWI Local Queues           9.919 M    10.878 M    11.594 M    11.769 M    12.569 M
- Depth    2                9.733 M     8.484 M     9.122 M     9.175 M     9.773 M
- Depth   16               10.259 M     9.908 M    10.153 M    10.251 M    10.111 M
- Depth   64               10.363 M    10.330 M    10.390 M    10.338 M    10.208 M
- Depth  512               10.390 M    10.358 M    10.417 M    10.313 M    10.399 M

Post
QUWI Local Queues          11.641 M    11.216 M    10.038 M    11.962 M    12.615 M
- Depth    2                9.851 M     8.609 M     9.129 M     9.589 M     9.479 M
- Depth   16               10.448 M    10.226 M    10.272 M    10.286 M    10.195 M
- Depth   64               10.410 M    10.358 M    10.319 M    10.491 M    10.393 M
- Depth  512               10.492 M    10.340 M    10.559 M    10.576 M    10.469 M

Generally improved

i5 4 core no HT - 500 min thread
Pre
QUWI Local Queues          12.616 M    11.392 M     9.527 M    11.639 M    12.307 M
- Depth    2                9.701 M     7.792 M     8.577 M     8.922 M     8.949 M
- Depth   16               10.014 M     9.941 M    10.122 M    10.106 M     8.767 M
- Depth   64               10.348 M    10.159 M    10.361 M    10.379 M    10.243 M
- Depth  512               10.506 M     9.424 M    10.246 M    10.377 M    10.432 M

Post
QUWI Local Queues          12.793 M    11.522 M    10.079 M    11.804 M    11.943 M
- Depth    2                9.877 M     8.447 M     8.950 M     9.430 M     8.963 M
- Depth   16               10.368 M    10.017 M    10.176 M    10.311 M    10.244 M
- Depth   64               10.311 M    10.502 M    10.431 M    10.472 M    10.334 M
- Depth  512               10.503 M    10.562 M    10.420 M    10.490 M    10.463 M

Generally improved

i7 4 core 8 HT
Pre
QUWI Local Queues           4.251 M     4.817 M     6.366 M     7.812 M     8.858 M
- Depth    2                4.504 M     5.076 M     5.963 M     7.407 M     8.038 M
- Depth   16                6.782 M     6.557 M     6.634 M     6.704 M     6.733 M
- Depth   64                6.767 M     6.618 M     6.809 M     6.769 M     6.827 M
- Depth  512                6.814 M     6.818 M     6.713 M     6.838 M     6.611 M

Post
QUWI Local Queues           3.932 M     4.810 M     5.890 M     8.106 M     8.346 M
- Depth    2                4.928 M     4.820 M     6.101 M     7.347 M     7.784 M
- Depth   16                6.756 M     6.596 M     6.554 M     6.729 M     6.624 M
- Depth   64                6.826 M     6.696 M     6.667 M     6.740 M     6.820 M
- Depth  512                6.765 M     6.799 M     6.663 M     6.723 M     6.725 M

Slight regression

i7 4 core 8 HT - 500 min thread
Pre
QUWI Local Queues           7.620 M     7.933 M     7.522 M     7.236 M     7.932 M
- Depth    2                7.036 M     7.566 M     6.363 M     7.225 M     6.275 M
- Depth   16                6.763 M     6.621 M     6.540 M     6.609 M     6.454 M
- Depth   64                6.736 M     6.719 M     6.671 M     6.699 M     6.725 M
- Depth  512                6.782 M     6.668 M     6.611 M     6.633 M     6.661 M

Post
QUWI Local Queues           8.912 M     8.746 M     6.623 M     7.563 M     8.078 M
- Depth    2                7.468 M     7.559 M     6.216 M     7.318 M     7.141 M
- Depth   16                6.645 M     6.554 M     6.501 M     6.686 M     6.607 M
- Depth   64                6.723 M     6.706 M     6.666 M     6.707 M     6.657 M
- Depth  512                6.766 M     6.517 M     6.552 M     6.635 M     6.477 M

Mixed

@benaadams
Copy link
Copy Markdown
Member Author

benaadams commented Aug 27, 2016

Added some impressions to the before and after for the effects on threadpool; by eyeball so take with a pinch of salt.

Overall I think this is an improvement to that also.

Still haven't found what heavily impacts QUWI performance on HT (last set of results, second cpu is more powerful, but also HT); my quest continues...


// After how many yields, check the timeout
private const int TIMEOUT_CHECK_FREQUENCY = 10;
private const int TIMEOUT_CHECK_FREQUENCY_MASK = 16;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little concerned about these changes. I don't remember how much effort went into selecting the values initially, but these could have a real impact on usage, and issues that arise from such changes could be difficult to spot from limited microbenchmark-based testing. Lots of factors impact this, including number of cores, layout of cores, usage patterns, etc.

@stephentoub
Copy link
Copy Markdown
Member

stephentoub commented Aug 27, 2016

@benaadams, thanks for the obvious effort you've put into this. I have to say, though, I started looking through it, and I'm feeling uneasy about this change. There's a lot that's rolled up into it, when the initial goal was here was just around removing the bulk of the additional work when a timeout of 0 was provided. The numbers you shared for throughput improvement in that case don't seem significantly different between the initial measurements from when this was just a few lines changed to now when there's several hundred lines changed. I get nervous when such a low-level, threading-related type is changed in this manner. What's the bare minimum change necessary to achieve the bulk of the benefits? Other incremental changes could be considered on their own after that; I would prefer not to roll all such changes together.

cc: @kouvel, @ericeil

@benaadams
Copy link
Copy Markdown
Member Author

Min changes would look something like #6952 though it still is doing extra work like checking if the timeout has passed after a single CAS/Increment which is unlikely to take >= 1ms which would be the min value for the test to fail.

@stephentoub
Copy link
Copy Markdown
Member

though it still is doing extra

And how does the throughput improvement from that for TryEnter(0, ...) compare to all of these changes?

@benaadams
Copy link
Copy Markdown
Member Author

Probably close... added second change for overchecking the timeout.

Will run tests, though I imagine most of the gains were from the fail fast path.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants