SemaphoreSlim performance improvement#137
Conversation
|
@ikopylov, thank you very much for your contribution. I have not had a chance to look at your change yet. I'll take a look and reply back with comments. Thanks. |
|
Unfortunately this tweak doesn't solve the problem for all configurations. I have an opportunity to test it on 24 core CPU (2x Intel Xeon X5675) and here's the results: With tweak: *Because of the slow execution I've reduced the number of iterations by factor of 10. As you can see the updated version of SemaphoreSlim works inferior to the original in this environment. Now it's obvious that this issue happened not because the number of threads more than number of cores. But still there's definitely some problem with the implementation of SemaphoreSlim. I tried to wrap Wait call in additional lock and it begins to work as it should (OMG). while (!srcCancel.IsCancellationRequested)
{
lock (waiterSyncObj)
sem.Wait(myToken);
Thread.SpinWait(takeSpin);
}I have following results: From now I think that the problem caused by the call of Monitor.PulseAll. It lead to awakening of all threads, but only few of them exits the lock, all others come back to Wait state. |
|
@ikopylov, thank you very much for your contribution. In general, it sounds like you're doing a great job of thinking about and testing a wide variety of configurations. This is great. When it comes to spinning and locking changes, we've generally spent a great deal of time building up confidence in the changes. So, while I don't want to discourage you from making changes to help improve things in this space, I do want you to be aware that the bar for changes here will be very high, and that you'll likely get lots of questions and or requests to test various configs. |
|
@ikopylov, thanks for looking into this. What you have implemented is effectively a "ticket lock" (so named because of its resemblance to "now serving" tickets you might find at a bakery, or the DMV). This has the nice property of reducing memory contention while threads are waiting, as you intend. However, it also has the possibly unintended consequence of making this a "fair" lock, in that requests are serviced in the order in which they were received. This can lead to a number of performance issues, including so-called "lock convoys" as well as "thrashing" of the CPU caches: the algorithm prefers to give control to threads that have been waiting the longest, and therefore are least likely to have their data "hot" in the CPU cache. These effects tend to be magnified as more threads contend for the lock, so an algorithm that works very well for low concurrency levels (one or two threads per core) may scale very badly as more threads are added. I don't know for sure that either of these are the cause of the performance issues you have observed, but perhaps this provides some clues. I agree that Montor.Pulse is a likely culprit in the existing implementation, and it may be worthwhile to look for alternatives. I'll also echo Brian's statement that the bar for changes in this area is very high. Synchronization primitives need to perform reasonably well across a vast number of machine configurations and usage scenarios, and it is simply not practical to test all of the scenarios and configurations that may be important to users, so we are very careful about accepting changes to synchronization code that has been working "well enough" so far. |
|
Thanks for your response. I understand that this is a critical part of Framework and all changes should be tested carefully in many different environments. But the performance drop here is enormous (up to 250x slower than expected), so I believe that it should be fixed as soon as possible. I pointed out the problem and if I fail to fix it, I hope that CoreCLR team will do it. Meanwhile, I've tested the hypothesis that the performance drop connected with Monitor.PulseAll call. I've added a counter to estimate the number of false-wakeups and here the results: Now I'm pretty sure, that this is the root of the problem. My initial "ticket-lock" approach just hide it in some configurations. Probably the easiest way to fix the issue is to call Monitor.Pulse only by the number of waiters. if (currentCount == 1 || waitCount == 1)
{
Monitor.Pulse(m_lockObj);
}
else if (waitCount > 1)
{
Monitor.PulseAll(m_lockObj);
}Do something like this: for (int i = 0; i < Math.Min(releaseCount, waitCount); i++)
{
Monitor.Pulse(m_lockObj);
}Quick test shows that this change improves the performance and reduce the number of false-wakeups, but the number of Monitor.Pulse call should be evaluated a little bit smarter. |
|
I have update Pull Request according to my previous comment. In this modification Monitor.Pulse called by the number of relaseCount (and waiters) instead of calling Monitor.PulseAll. This approach reduce the number of False-Wakeups in corner cases. Here the results of the test (8 core cpu Intel i7-4770k). Modified: Still not the expected results ("1, 16" should be no more than 2 times slower than "1, 8"), but much better than original. |
|
Test on 24 core CPU (2x Intel Xeon X5675). Modified: Looks great. |
|
The change looks simple which is great. My concern is that we need an analysis that tells us that we won't have deadlock bugs because we did not release enough threads. Can you also describe the scenario that motivated the change (that is what is your scenario that has few Releasers but many waiters? |
Yes, sure. Here are some considerations. All critical code is executed inside the critical section. So the only possible "race" may occur between threads entering that critical section: the newcomer waiter can enter the critical section before notified thread. But this is not a problem, the first thread just get the ticket and the second - check the condition and return to the waiting state (false-wakeup).
Standard Producer-Consumer scenario. Personally, I detect this performance issue by doing some tests on BlockingCollection. It worked surprisingly slow with single Adder and many Takers. And the situation was getting worse on the machines with the large number of cores. |
|
@ikopylov, my main concern is whether the variables being used (releaseCount and waitCount), are truly what their words say the are under all conditions. If they were underestimations (because of a race), we could introduce deadlock behavior. I am looking for a rationale / careful review that gives us confidence this is true. |
|
Dear @vancem. Next, I try to provide an explanation for the code working with releaseCount and waitCount. releaseCount variable is just passed as the argument to the Release method: public int Release(int releaseCount) { /*....*/ }It is more interesting how the m_currentCount variable updated based on the releaseCount value: lock (m_lockObj)
{
int currentCount = m_currentCount;
// ...
currentCount += releaseCount;
// Signal to any synchronous waiters
int waitCount = m_waitCount;
int waitersToNotify = Math.Min(releaseCount, waitCount);
for (int i = 0; i < waitersToNotify; i++)
{
Monitor.Pulse(m_lockObj);
}
//...
m_currentCount = currentCount;
//...
}Variable m_currentCount is volatile and updated only after Monitor.Pulse call. So the number of Monitor.Pulse calls is sufficient to process the new tickets. If we got some unexpected exception (e.g. ThreadAbortException) the logic will not be broken, because the notification is happened before the updating of m_currentCount. The next important variable is m_waitCount (waitCount is just a local copy of m_waitCount). bool lockTaken = false;
//.....
try
{
//.....
// entering the lock and incrementing waiters must not suffer a thread-abort, else we cannot
// clean up m_waitCount correctly, which may lead to deadlock due to non-woken waiters.
try { }
finally
{
Monitor.Enter(m_lockObj, ref lockTaken);
if (lockTaken)
{
m_waitCount++;
}
}
//.....
}
finally
{
// Release the lock
if (lockTaken)
{
m_waitCount--;
Monitor.Exit(m_lockObj);
}
}One can see that this variable is incremented immediately after entering the critical section inside finally block. And it is decremented right before exiting the critical section inside another finally block. The logic here is solid and can't be broken by any unexpected exception. |
|
@ikopylov, Thanks for signing the contribution license agreement so quickly! Actual humans will now validate the agreement and then evaluate the PR. |
|
@ikopylov Sorry for the stupid question, but how did you manage to run those tests against your custom build. I have some ideas on how it can be improved but just cannot isolate that. I feel miserable and irritated 😞 |
|
@alexandrnikitin, I have created a simple Console Application project and ran these tests. To test the effect of my modifications, I've copied the source code of the SemaphoreSlim to the project and applied the changes to that copy. |
|
@ikopylov I tried that one too, but I failed 😭 |
|
@alexandrnikitin, All the internal dependencies are needed for the asynchronous waiting. If you do not intend to improve async scenarios, then you can easily comment them out. |
|
@ikopylov But that isn't quite fair, you changed the memory layout of the class, that actually counts in our case. And there are still some methods like the following that I have to comment out. In that setup, it gave 10% already 😄 |
|
@alexandrnikitin You can use |
|
@ikopylov int waitersToNotify = Math.Min(releaseCount, waitCount);
for (int i = 0; i < waitersToNotify; i++)
{
Monitor.Pulse(m_lockObj);
}In your tests |
|
I created a repo with benchmarks in case anyone needs it https://github.com/alexandrnikitin/dotnet-coreclr-semaphoreslim-performance |
Actually, the difference is not significant in multithreaded scenarios. Performance problem is not related to
Waiting threads are waiting until
The semaphore would not work at all in that case, since all the waiting threads would never be released. My update related to the False-Wakeup problem. The problem occurs when the |
|
Then, probably, you meant I think the idea of The idea of spins and SpinWait spin = new SpinWait();
while (m_currentCount == 0 && !spin.NextSpinWillYield)
{
spin.SpinOnce();
}
try { }
finally
{
Monitor.Enter(m_lockObj, ref lockTaken);
if (lockTaken)
{
m_waitCount++;
}
}If you have a lot of waiters then they spend most of the time in wait state here: In my opinion the "problem" here is in contention for |
I didn't say anything about But indeed,
So why do you think that it is better to wake-up all threads, even if we know that only one of them will exit the lock and continue the execution?
The idea of spins here is to wait for the
To be more precise, here. And that is the normal behavior. They wait until
The contention for |
|
@ikopylov Thank you for your answer.
I'm sorry, I wasn't clear enough. I meant the code you changed: int waitersToNotify = Math.Min(releaseCount, waitCount);
for (int i = 0; i < waitersToNotify; i++)
{
Monitor.Pulse(m_lockObj);
}Where |
Yes, Perhaps, it will be easier to understand this through the analysis of the possible values of main variables. There are 3 stable and 1 temporary states:
You can write a micro-benchmark and you'll see that the absolute time is microscopic compared to the total execution time of the
You probably missed this line of code. |
No, I didn't. All Releases are finished at that moment and you enter the semaphore without any contention. Actually, I don't get the meaning of that line.
I didn't mean the |
Agreed! You are right. Using I just cannot find comprehensive tests to test that. |
I've just rechecked the code of the test. You are right. There is a small flaw in the test, but it doen't affect the results much. Probably, it is much better to measure not the total execution time, but the number of Waits per Second and Releases per Second.
I've looked at your PR. The additional
Yes, the global consequences from a single |
|
Some measurements of "Waits per Second" and "Releases per Second". The numbers are little bit unstable, but the win of fixed semaphore is obvious in "1, 16" setting. Measurement taken on i7-4770k (8 cores: 4 physical + 4HT). Original Modified |
|
I'm thinking about more common usage pattern of a semaphore when you wait for the semaphore, perform some actions and then release. https://github.com/alexandrnikitin/dotnet-coreclr-semaphoreslim-performance/pull/2/files And here are results: https://gist.github.com/alexandrnikitin/984bcedd7e9813b919cb |
Interesting results. But they are nothing without the answers to the following questions:
After some analysis, I found the problem in the test itself: spinning interval is too short. If you increase that interval to 50 iterations, the difference disappears. And now the answers The original implementation of semaphore calls That difference is not observed in "1-4" setting, because 3 threads have enough time to re-check the condition and return to the Waiting state. I think that the behavior of fixed semaphore is better even in that purely synthetic test. And in the real-world scenarious the modified semaphore should outperform the original all the time. |
|
@gkhanna79 @vancem @brianrob , What are the next steps here? I see we have an issue opened tracking this. I'd prefer not to have PR's sit open for 18 months. We should come up with next steps and track ongoing work in issues. |
|
I have reviewed the code and the discussion above and done some experimentation and I have convinced myself that the change is simple and safe and should have the improvements that the benchmarks demonstrated. The tests for some reason re-spun, but and now three are failing but they were succeeding before and there has been no change. Unless someone chimes in today, I will merge this change tomorrow morning. Vance |
|
@dotnet-bot Test OSX x64 Checked Build and Test |
|
There are two failures (OSX and Ubuntu) at the current time, however these are currently failing (in the same way) for all builds at this time). Moreover in past this checked DID clear all tests. The change is also only going to affect SlimReaderWriterLock and is not going to be OS specific. Thus I believe it is OK to check in. |
* Waiters notification by the value of releaseCount (reduce the number of false-wakeups).
* Waiters notification by the value of releaseCount (reduce the number of false-wakeups). Commit migrated from dotnet/coreclr@6317f26
This tweak improves performance for the case when the number of working threads more than the number of CPU cores.
The original code uses SpinWait before calling Monitor.Enter(), but the spinning condition is not perfect. To reduce the number of unnecessary waiting on the Monitor it would be good to wait while other threads finished their work inside critical section. The idea of this tweak is to arrange threads on SpinWait so that they do not enter critical section concurrently.
Measurement taken on i7-4770k (8 cores: 4 physical + 4HT).
SemaphoreSlim tests (source code of the test can be found here).
Old:
New:
For the '1, 16' case the boost is obvious. For other cases the difference is comparable to the measurement error.
BlockingCollection uses SemaphoreSlim under cover so I decided to test it too (source code of the test can be found here).
Old:
New:
Again you can see the measurable boost when the number of threads more than the number of cores.