SpinLock.TryEnter fail fast for timeout 0#6944
Conversation
|
Results using https://github.com/benaadams/ThreadPoolTaskTesting Individual result variances due to threading timing and GC; also Improvements across the board Some regression Improvements across the board General Improvements |
Improvements across the board Mixed Improvements across the board Similar |
Some Improvement Similar Mixed Similar |
Improvements across the board Mixed/Improve Mostly Improved Mostly Improved |
Improvements across the board Generally improved Some regression Some regression |
Similar Similar Similar Similar |
Slight improvement Slight improvement Similar Similar |
Generally improved Generally improved Slight regression Mixed |
|
Added some impressions to the before and after for the effects on threadpool; by eyeball so take with a pinch of salt. Overall I think this is an improvement to that also. Still haven't found what heavily impacts QUWI performance on HT (last set of results, second cpu is more powerful, but also HT); my quest continues... |
|
|
||
| // After how many yields, check the timeout | ||
| private const int TIMEOUT_CHECK_FREQUENCY = 10; | ||
| private const int TIMEOUT_CHECK_FREQUENCY_MASK = 16; |
There was a problem hiding this comment.
I'm a little concerned about these changes. I don't remember how much effort went into selecting the values initially, but these could have a real impact on usage, and issues that arise from such changes could be difficult to spot from limited microbenchmark-based testing. Lots of factors impact this, including number of cores, layout of cores, usage patterns, etc.
|
@benaadams, thanks for the obvious effort you've put into this. I have to say, though, I started looking through it, and I'm feeling uneasy about this change. There's a lot that's rolled up into it, when the initial goal was here was just around removing the bulk of the additional work when a timeout of 0 was provided. The numbers you shared for throughput improvement in that case don't seem significantly different between the initial measurements from when this was just a few lines changed to now when there's several hundred lines changed. I get nervous when such a low-level, threading-related type is changed in this manner. What's the bare minimum change necessary to achieve the bulk of the benefits? Other incremental changes could be considered on their own after that; I would prefer not to roll all such changes together. |
|
Min changes would look something like #6952 though it still is doing extra work like checking if the timeout has passed after a single CAS/Increment which is unlikely to take >= 1ms which would be the min value for the test to fail. |
And how does the throughput improvement from that for TryEnter(0, ...) compare to all of these changes? |
|
Probably close... added second change for overchecking the timeout. Will run tests, though I imagine most of the gains were from the fail fast path. |
Previously the timeout 0 would Interlocked.Add to set the waiters then CAS spin to unset it immediately after; now it exits before trying to set the waiters so skips both.
Added a
Thread.SpinWait(1)when the thread didn't yield.Changed the spin calculation to do the same thing (fairly sure) but in less operations per iteration.
Changed the exit mechanism to use
breakrather than inline returns as it generates less asm, for popping the registers - makes it less clear though :-/Improved handling for high waiter count; though it would be in the billion range so hopefully would never be hit on a spinwait anyway (otherwise something has gone very wrong).
Moved yielding to the start of the yield loop; as if you've got there you've already just tried to acquire the lock.
Moved the time check after spinning into the if; as if you've skipped spinning you haven't really done anything yet so no point in checking if timed out by >= 1 millisecond (and 0 zero timeout fast-paths at the start).
Changed the spin type thresholds to be powers of 2 and changed the
%/idiv/modto&. Which means: Sleep(1) moved from 40 -> 64; Sleep(0) moved from 10 -> 16Added comments throughout, also corrected some strange spellings.
Trims the asm from 1107 bytes of instructions to 890; and jit local vars from 31 to 24 (more loc, less tmp and less cse)
Also indentation changes due to some of the rearrangement so &w=1 is better for a compare.
Passes corefx tests
1M iters (single thread, uncontended but locked) for code
Adding threadpool perf timings to gist and will post highlights
@stephentoub follow up on #6911