Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

SemaphoreSlim performance improvement#137

Merged
vancem merged 5 commits into
dotnet:masterfrom
ikopylov:SemaphoreSlim_perf_fix
Oct 11, 2016
Merged

SemaphoreSlim performance improvement#137
vancem merged 5 commits into
dotnet:masterfrom
ikopylov:SemaphoreSlim_perf_fix

Conversation

@ikopylov
Copy link
Copy Markdown

@ikopylov ikopylov commented Feb 7, 2015

This tweak improves performance for the case when the number of working threads more than the number of CPU cores.

The original code uses SpinWait before calling Monitor.Enter(), but the spinning condition is not perfect. To reduce the number of unnecessary waiting on the Monitor it would be good to wait while other threads finished their work inside critical section. The idea of this tweak is to arrange threads on SpinWait so that they do not enter critical section concurrently.

Measurement taken on i7-4770k (8 cores: 4 physical + 4HT).

SemaphoreSlim tests (source code of the test can be found here).
Old:

<# of threads calling Release>, <# of threads calling Wait>: <Average execution time>
1, 1:   1585ms
8, 8:   2367ms
1, 8:   3727ms
1, 16: 97430ms
8, 1:   1522ms
16, 1:  1515ms

New:

1, 1:   1488ms
8, 8:   2402ms
1, 8:   3556ms
1, 16:  6649ms
8, 1:   1546ms
16, 1:  1489ms

For the '1, 16' case the boost is obvious. For other cases the difference is comparable to the measurement error.

BlockingCollection uses SemaphoreSlim under cover so I decided to test it too (source code of the test can be found here).
Old:

<# of threads adding elements>, <# of threads taking elements>: <Average execution time>
1, 1:    2106ms
4, 4:    2111ms
8, 8:    2683ms
16, 1:  24670ms
1, 16:  22661ms
16, 16:  3147ms

New:

1, 1:    2126ms
4, 4:    1888ms
8, 8:    2176ms
16, 1:   2997ms
1, 16:   3577ms
16, 16:  2138ms

Again you can see the measurable boost when the number of threads more than the number of cores.

@brianrob
Copy link
Copy Markdown
Member

brianrob commented Feb 9, 2015

@ikopylov, thank you very much for your contribution.

I have not had a chance to look at your change yet. I'll take a look and reply back with comments.

Thanks.

@brianrob brianrob self-assigned this Feb 9, 2015
@ikopylov
Copy link
Copy Markdown
Author

ikopylov commented Feb 9, 2015

Unfortunately this tweak doesn't solve the problem for all configurations.

I have an opportunity to test it on 24 core CPU (2x Intel Xeon X5675) and here's the results:
Original:

1, 1:     280ms
24, 24:   558ms
1, 8:     526ms
1, 12:   4201ms
1, 24:  76981ms
4, 24:   1093ms
4, 48:  33501ms
8, 1:     184ms
12, 1:    241ms
24, 1:    248ms

With tweak:

1, 1:     202ms
24, 24:   572ms
1, 8:     455ms
1, 12:  21572ms
1, 24:  92623ms
4, 24:  11188ms
4, 48:  10140ms
8, 1:     183ms
12, 1:    251ms
24, 1:    251ms

*Because of the slow execution I've reduced the number of iterations by factor of 10.

As you can see the updated version of SemaphoreSlim works inferior to the original in this environment. Now it's obvious that this issue happened not because the number of threads more than number of cores. But still there's definitely some problem with the implementation of SemaphoreSlim.

I tried to wrap Wait call in additional lock and it begins to work as it should (OMG).
So with the code of test that looks like this:

while (!srcCancel.IsCancellationRequested)
{
    lock (waiterSyncObj)
        sem.Wait(myToken);
    Thread.SpinWait(takeSpin);
}

I have following results:

1, 1:   253ms
24, 24: 374ms
1, 8:   244ms
1, 12:  375ms
1, 24:  405ms
4, 24:  330ms
4, 48:  324ms
8, 1:   176ms
12, 1:  236ms
24, 1:  247ms

From now I think that the problem caused by the call of Monitor.PulseAll. It lead to awakening of all threads, but only few of them exits the lock, all others come back to Wait state.
I'll continue investigation, but this Pull-Request should not be accepted at this state.

@brianrob
Copy link
Copy Markdown
Member

brianrob commented Feb 9, 2015

@ikopylov, thank you very much for your contribution.

In general, it sounds like you're doing a great job of thinking about and testing a wide variety of configurations. This is great.

When it comes to spinning and locking changes, we've generally spent a great deal of time building up confidence in the changes. So, while I don't want to discourage you from making changes to help improve things in this space, I do want you to be aware that the bar for changes here will be very high, and that you'll likely get lots of questions and or requests to test various configs.

@ericeil
Copy link
Copy Markdown

ericeil commented Feb 10, 2015

@ikopylov, thanks for looking into this. What you have implemented is effectively a "ticket lock" (so named because of its resemblance to "now serving" tickets you might find at a bakery, or the DMV). This has the nice property of reducing memory contention while threads are waiting, as you intend. However, it also has the possibly unintended consequence of making this a "fair" lock, in that requests are serviced in the order in which they were received. This can lead to a number of performance issues, including so-called "lock convoys" as well as "thrashing" of the CPU caches: the algorithm prefers to give control to threads that have been waiting the longest, and therefore are least likely to have their data "hot" in the CPU cache. These effects tend to be magnified as more threads contend for the lock, so an algorithm that works very well for low concurrency levels (one or two threads per core) may scale very badly as more threads are added. I don't know for sure that either of these are the cause of the performance issues you have observed, but perhaps this provides some clues.

I agree that Montor.Pulse is a likely culprit in the existing implementation, and it may be worthwhile to look for alternatives.

I'll also echo Brian's statement that the bar for changes in this area is very high. Synchronization primitives need to perform reasonably well across a vast number of machine configurations and usage scenarios, and it is simply not practical to test all of the scenarios and configurations that may be important to users, so we are very careful about accepting changes to synchronization code that has been working "well enough" so far.

@ikopylov
Copy link
Copy Markdown
Author

Thanks for your response.

I understand that this is a critical part of Framework and all changes should be tested carefully in many different environments. But the performance drop here is enormous (up to 250x slower than expected), so I believe that it should be fixed as soon as possible. I pointed out the problem and if I fail to fix it, I hope that CoreCLR team will do it.

Meanwhile, I've tested the hypothesis that the performance drop connected with Monitor.PulseAll call. I've added a counter to estimate the number of false-wakeups and here the results:

1, 1:   1489ms.  False-WakeUps: 0.
1, 8:   3467ms.  False-WakeUps: 3599.
1, 16: 69707ms.  False-WakeUps: 4708503.

Now I'm pretty sure, that this is the root of the problem. My initial "ticket-lock" approach just hide it in some configurations.

Probably the easiest way to fix the issue is to call Monitor.Pulse only by the number of waiters.
So instead of this:

if (currentCount == 1 || waitCount == 1)
{
    Monitor.Pulse(m_lockObj);
}
else if (waitCount > 1)
{
    Monitor.PulseAll(m_lockObj);
}

Do something like this:

for (int i = 0; i < Math.Min(releaseCount, waitCount); i++)
{
    Monitor.Pulse(m_lockObj);
}

Quick test shows that this change improves the performance and reduce the number of false-wakeups, but the number of Monitor.Pulse call should be evaluated a little bit smarter.

@ikopylov
Copy link
Copy Markdown
Author

I have update Pull Request according to my previous comment. In this modification Monitor.Pulse called by the number of relaseCount (and waiters) instead of calling Monitor.PulseAll. This approach reduce the number of False-Wakeups in corner cases.

Here the results of the test (8 core cpu Intel i7-4770k).
Original:

1, 1:    1491ms.   False-Wakeups:       0.
8, 8:    2293ms.   False-Wakeups:       0.
16, 16:  2390ms.   False-Wakeups:       5.
1, 8:    3443ms.   False-Wakeups:    3853. 
1, 16:  53624ms.   False-Wakeups: 3427029.
4, 24:   5915ms.   False-Wakeups:   74252.
8, 1:    1498ms.   False-Wakeups:       0.
16, 1:   1476ms.   False-Wakeups:       0.

Modified:

1, 1:    1440ms.   False-Wakeups:       0.
8, 8:    2346ms.   False-Wakeups:       0.
16, 16:  2345ms.   False-Wakeups:       2.
1, 8:    3475ms.   False-Wakeups:     503. 
1, 16:  18738ms.   False-Wakeups:   73251.
4, 24:   4607ms.   False-Wakeups:    7924.
8, 1:    1551ms.   False-Wakeups:       0.
16, 1:   1509ms.   False-Wakeups:       0.

Still not the expected results ("1, 16" should be no more than 2 times slower than "1, 8"), but much better than original.

@ikopylov
Copy link
Copy Markdown
Author

Test on 24 core CPU (2x Intel Xeon X5675).
*The number of iterations reduced by factor of 10.
Original:

1, 1:      203ms.   False-wakeups:        0.
24, 24:    497ms.   False-wakeups:        2.
1, 12:    5077ms.   False-wakeups:   470355.
1, 24:   77203ms.   False-wakeups:  8311677.
1, 48:  161288ms.   False-wakeups: 19293180.
4, 48:   29413ms.   False-wakeups:  1811314.
24, 1:     251ms.   False-wakeups:        0. 

Modified:

1, 1:      198ms.   False-wakeups:        0.
24, 24:    519ms.   False-wakeups:        2.
1, 12:     787ms.   False-wakeups:     2900.
1, 24:    2705ms.   False-wakeups:    13404.
1, 48:    3863ms.   False-wakeups:    23313.
4, 48:     862ms.   False-wakeups:      384.
24, 1:     254ms.   False-wakeups:        0. 

Looks great.

@vancem
Copy link
Copy Markdown

vancem commented Feb 12, 2015

The change looks simple which is great. My concern is that we need an analysis that tells us that we won't have deadlock bugs because we did not release enough threads.

Can you also describe the scenario that motivated the change (that is what is your scenario that has few Releasers but many waiters?

@ikopylov
Copy link
Copy Markdown
Author

My concern is that we need an analysis that tells us that we won't have deadlock bugs because we did not release enough threads.

Yes, sure. Here are some considerations.
The number of threads to notify is calculated by this simple equation: Math.Min(releaseCount, waitCount). There's no need to Pulse more threads than the number of tickets released by the current call (releaseCount), because all threads beyond that count will not have tickets to take inside Wait. It is also obvious that there's no sense to Pulse more threads than is currently waiting for tickets (waitCount).

All critical code is executed inside the critical section. So the only possible "race" may occur between threads entering that critical section: the newcomer waiter can enter the critical section before notified thread. But this is not a problem, the first thread just get the ticket and the second - check the condition and return to the waiting state (false-wakeup).

Can you also describe the scenario that motivated the change (that is what is your scenario that has few Releasers but many waiters?

Standard Producer-Consumer scenario.
I think that this is a relatively frequent scenario where the number of Producers is small and the number of Consumers is equal or even greater than the number of cores.

Personally, I detect this performance issue by doing some tests on BlockingCollection. It worked surprisingly slow with single Adder and many Takers. And the situation was getting worse on the machines with the large number of cores.

@vancem
Copy link
Copy Markdown

vancem commented Feb 13, 2015

@ikopylov, my main concern is whether the variables being used (releaseCount and waitCount), are truly what their words say the are under all conditions. If they were underestimations (because of a race), we could introduce deadlock behavior. I am looking for a rationale / careful review that gives us confidence this is true.

@ikopylov
Copy link
Copy Markdown
Author

Dear @vancem.
First of all, my Pull-Request doesn't affect the updating logic of any variable. So if the original version of SemaphoreSlim works perfect, mine should also be correct. This is not the best argument, but still should be taken into account.

Next, I try to provide an explanation for the code working with releaseCount and waitCount.

releaseCount variable is just passed as the argument to the Release method:

public int Release(int releaseCount) { /*....*/ }

It is more interesting how the m_currentCount variable updated based on the releaseCount value:

lock (m_lockObj)
{
    int currentCount = m_currentCount;
    // ...
    currentCount += releaseCount;

    // Signal to any synchronous waiters
    int waitCount = m_waitCount;

    int waitersToNotify = Math.Min(releaseCount, waitCount);
    for (int i = 0; i < waitersToNotify; i++)
    {
        Monitor.Pulse(m_lockObj);
    }
    //...
    m_currentCount = currentCount;
    //...
}

Variable m_currentCount is volatile and updated only after Monitor.Pulse call. So the number of Monitor.Pulse calls is sufficient to process the new tickets. If we got some unexpected exception (e.g. ThreadAbortException) the logic will not be broken, because the notification is happened before the updating of m_currentCount.

The next important variable is m_waitCount (waitCount is just a local copy of m_waitCount).
It's updated inside Wait method:

bool lockTaken = false;
//.....
try
{
    //.....

    // entering the lock and incrementing waiters must not suffer a thread-abort, else we cannot
    // clean up m_waitCount correctly, which may lead to deadlock due to non-woken waiters.
    try { }
    finally
    {
        Monitor.Enter(m_lockObj, ref lockTaken);
        if (lockTaken)
        {
            m_waitCount++;
        }
    }

    //.....
}
finally
{
    // Release the lock
    if (lockTaken)
    {
        m_waitCount--;
        Monitor.Exit(m_lockObj);
    }
}

One can see that this variable is incremented immediately after entering the critical section inside finally block. And it is decremented right before exiting the critical section inside another finally block. The logic here is solid and can't be broken by any unexpected exception.

@dnfclas
Copy link
Copy Markdown

dnfclas commented Mar 24, 2015

@ikopylov, Thanks for signing the contribution license agreement so quickly! Actual humans will now validate the agreement and then evaluate the PR.

Thanks, DNFBOT;

@richlander richlander added the tenet-performance Performance related issue label Mar 25, 2015
@brianrob brianrob removed their assignment May 10, 2015
@karelz karelz added this to the Future milestone Oct 28, 2015
@alexandrnikitin
Copy link
Copy Markdown

@ikopylov Sorry for the stupid question, but how did you manage to run those tests against your custom build. I have some ideas on how it can be improved but just cannot isolate that. I feel miserable and irritated 😞

@ikopylov
Copy link
Copy Markdown
Author

@alexandrnikitin, I have created a simple Console Application project and ran these tests. To test the effect of my modifications, I've copied the source code of the SemaphoreSlim to the project and applied the changes to that copy.

@alexandrnikitin
Copy link
Copy Markdown

@ikopylov I tried that one too, but I failed 😭 SemaphoreSlim pulls a lot of internal dependencies with it. I tried to compile that console app against my custom built mscorlib.dll but it doesn't compile for some reason. I tried to write those test inside mscorlib project, but that didn't work too.

@ikopylov
Copy link
Copy Markdown
Author

@alexandrnikitin, All the internal dependencies are needed for the asynchronous waiting. If you do not intend to improve async scenarios, then you can easily comment them out.

@alexandrnikitin
Copy link
Copy Markdown

@ikopylov But that isn't quite fair, you changed the memory layout of the class, that actually counts in our case. And there are still some methods like the following that I have to comment out.

CancellationTokenRegistration cancellationTokenRegistration = cancellationToken.InternalRegisterWithoutEC(s_cancellationTokenCanceledEventHandler, this);
// or 
cancellationTokenRegistration.Dispose();

In that setup, it gave 10% already 😄

@ikopylov
Copy link
Copy Markdown
Author

@alexandrnikitin You can use CancellationToken.Register instead. This gives a small overhead due to ExecutionContext capturing/restoring but that is not important.

@alexandrnikitin
Copy link
Copy Markdown

@ikopylov
Quick remark about your code:

int waitersToNotify = Math.Min(releaseCount, waitCount);
for (int i = 0; i < waitersToNotify; i++)
{
    Monitor.Pulse(m_lockObj);
}

In your tests releaseCount is always 1 and pulsing just once is obviously faster then pulsing all threads. In other words you replaced PulseAll with one Pulse. Yes, releasing a semaphore became much faster, but "wait" procedure will take more time in that case. You can even not call pulse at all, releasing will be superfast, but waiters will wait longer.
I don't know how to comprehensively test that yet.

@alexandrnikitin
Copy link
Copy Markdown

I created a repo with benchmarks in case anyone needs it https://github.com/alexandrnikitin/dotnet-coreclr-semaphoreslim-performance

@ikopylov
Copy link
Copy Markdown
Author

In your tests releaseCount is always 1 and pulsing just once is obviously faster then pulsing all threads. In other words you replaced PulseAll with one Pulse. Yes, releasing a semaphore became much faster...

Actually, the difference is not significant in multithreaded scenarios. Performance problem is not related to Pulse vs PulseAll execution time.

but "wait" procedure will take more time in that case.

Waiting threads are waiting until m_currentCount becomes positive to decrement it and to continue execution. My change does not affect m_currentCount in any way. So the "wait" procedure will take the same time as it took before.

You can even not call pulse at all, releasing will be superfast, but waiters will wait longer.

The semaphore would not work at all in that case, since all the waiting threads would never be released.

My update related to the False-Wakeup problem. The problem occurs when the m_currentCount has a value greater than 1 and the number of waiting threads is greater than m_currentCount. In the original implementation PulseAll called, so all the threads move from the Wait state to the Running state, but only some of them continue execution (according to the m_currentCount value). All other threads, which are not so fortunate, return to the Wait state. And this unwanted transition (Wait -> Running -> Wait) eats CPU resources and leads to a significant performance degradation.

@alexandrnikitin
Copy link
Copy Markdown

Then, probably, you meant currentCount variable instead of releaseCount because as i said releaseCount is always 1 in your tests and you always Pulse once.

I think the idea of PulseAll is to have a "fair" lock and provide the ability to acquire it to all threads, not only to a couple. That's why you have a lot of false wakeups.
Here's an interesting piece of code in Monitor.Enter insides:
https://github.com/dotnet/coreclr/blob/release/1.0.0-rc1/src/vm/syncblk.cpp#L2862
We can acquire the lock but still must wait for fairness sake. And yes, it's a waste of resources.

The idea of spins and m_currentCount check is to avoid "expensive" wait state in Monitor.Enter()

SpinWait spin = new SpinWait();
while (m_currentCount == 0 && !spin.NextSpinWillYield)
{
    spin.SpinOnce();
}
try { }
finally
{
    Monitor.Enter(m_lockObj, ref lockTaken);
    if (lockTaken)
    {
        m_waitCount++;
    }
}

If you have a lot of waiters then they spend most of the time in wait state here:
https://github.com/dotnet/coreclr/blob/release/1.0.0-rc1/src/mscorlib/src/System/Threading/SemaphoreSlim.cs#L386

In my opinion the "problem" here is in contention for m_lockObj between releasers and waiters.

@ikopylov
Copy link
Copy Markdown
Author

Then, probably, you meant currentCount variable instead of releaseCount because as i said releaseCount is always 1 in your tests and you always Pulse once.

I didn't say anything about releaseCount. All my explanation was around m_currentCount.

But indeed, releaseCount is always 1 in the tests (that is the most common practical scenario). And the Pulse called once most of the time, but not always, because waitCount can be 0.

I think the idea of PulseAll is to have a "fair" lock and provide the ability to acquire it to all threads, not only to a couple. That's why you have a lot of false wakeups.

PulseAll notifies all threads in the waiting queue, that they need to wake up and re-check some condition. Pulse do the same thing but only for a single thread at the head of the waiting queue. The order of lock aqcuiring is preserved in both cases.

So why do you think that it is better to wake-up all threads, even if we know that only one of them will exit the lock and continue the execution?

The idea of spins and m_currentCount check is to avoid "expensive" wait state in Monitor.Enter()

The idea of spins here is to wait for the Release call to avoid Monitor.Wait(). Monitor.Enter() is usually cheap.

If you have a lot of waiters then they spend most of the time in wait state here: https://github.com/dotnet/coreclr/blob/release/1.0.0-rc1/src/mscorlib/src/System/Threading/SemaphoreSlim.cs#L386

To be more precise, here. And that is the normal behavior. They wait until m_currentCount becomes positive. The problem is that with PulseAll, threads wake-up every time, check the condition and return to the Wait state again.

In my opinion the "problem" here is in contention for m_lockObj between releasers and waiters.

The contention for m_lockObj is the minor issue and cannot be avoided in any way.

@alexandrnikitin
Copy link
Copy Markdown

@ikopylov Thank you for your answer.

I didn't say anything about releaseCount. All my explanation was around m_currentCount.

I'm sorry, I wasn't clear enough. I meant the code you changed:

int waitersToNotify = Math.Min(releaseCount, waitCount);
for (int i = 0; i < waitersToNotify; i++)
{
    Monitor.Pulse(m_lockObj);
}

Where releaseCount is always 1, so that you Pulse just once. While currentCount could be greater than and you can wake up more threads. It's obvious that pulsing once is faster and pulsing several times. Basically, your tests measure how fast you release the semaphore. But doesn't take waiter into account. You can add a counter and check number of Wait enters, and how that number affected by changes in Release. An example here: alexandrnikitin/dotnet-coreclr-semaphoreslim-performance#1

@ikopylov
Copy link
Copy Markdown
Author

Where releaseCount is always 1, so that you Pulse just once. While currentCount could be greater than and you can wake up more threads.

Yes, currentCount can be greater than releaseCount, but most of the time that means that waitCount is 0. Still, due to non-deterministic order of lock acquisition, the following can be observed: min(currentCount, waitCount) > min(releaseCount, waitCount). This can happen only when Release aqcuires the lock before the waking-up thread, which was already notified in the previous call of Release.

Perhaps, it will be easier to understand this through the analysis of the possible values of main variables. There are 3 stable and 1 temporary states:

  1. currentCount > 0 && waitCount == 0 - stable state with no waiters. Obviously, we don't need to call Pulse.
  2. currentCount == 0 && waitCount == 0 - stable state of parity between the releasers and waiters. Again, Pulse does not need to be called.
  3. currentCount == 0 && waitCount > 0 - stable state, when semaphore don't have enough tickets for all waiting threads. We should call Pulse here according to the value of min(releaseCount, waitCount).
  4. currentCount > 0 && waitCount > 0 - temprorary state that can be observed right after the release of the lock by the other thread inside Release method. That state can be obtained only from state 3 and state 4. Observing this state of variables means, that Release already executed by the other thread and the following number of threads was already "pulsed": min(m_currentCount, waitCount). These threads just still did not aquire the lock, but when they do, the state will be changed to one of the 3 possible stable. So, in this case we should call Pulse according to the value of releaseCount (not the value of currentCount).

It's obvious that pulsing once is faster and pulsing several times.

You can write a micro-benchmark and you'll see that the absolute time is microscopic compared to the total execution time of the Release method.

Basically, your tests measure how fast you release the semaphore. But doesn't take waiter into account. You can add a counter and check number of Wait enters, and how that number affected by changes in Release.

You probably missed this line of code.

@alexandrnikitin
Copy link
Copy Markdown

Basically, your tests measure how fast you release the semaphore. But doesn't take waiter into account. You can add a counter and check number of Wait enters, and how that number affected by changes in Release.

You probably missed this line of code.

No, I didn't. All Releases are finished at that moment and you enter the semaphore without any contention. Actually, I don't get the meaning of that line.
Take a look at the amount of entries in PR. In your changes the number is 20-30% less.

It's obvious that pulsing once is faster and pulsing several times.

You can write a micro-benchmark and you'll see that the absolute time is microscopic compared to the total execution time of the Release method.

I didn't mean the Pulse and PulseAll methods in vacuum but the consequences they trigger. The methods are pretty straightforward. https://github.com/dotnet/coreclr/blob/release/1.0.0-rc1/src/vm/syncblk.cpp#L3573

@alexandrnikitin
Copy link
Copy Markdown

Observing this state of variables means, that Release already executed by the other thread and the following number of threads was already "pulsed": min(m_currentCount, waitCount). These threads just still did not aquire the lock, but when they do, the state will be changed to one of the 3 possible stable. So, in this case we should call Pulse according to the value of releaseCount (not the value of currentCount).

Agreed! You are right. Using m_currentCount doesn't make sense.

I just cannot find comprehensive tests to test that.

@ikopylov
Copy link
Copy Markdown
Author

Basically, your tests measure how fast you release the semaphore. But doesn't take waiter into account. You can add a counter and check number of Wait enters, and how that number affected by changes in Release.

You probably missed this line of code.

No, I didn't. All Releases are finished at that moment and you enter the semaphore without any contention. Actually, I don't get the meaning of that line.

I've just rechecked the code of the test. You are right. There is a small flaw in the test, but it doen't affect the results much. Probably, it is much better to measure not the total execution time, but the number of Waits per Second and Releases per Second.

Take a look at the amount of entries in PR. In your changes the number is 20-30% less.

I've looked at your PR. The additional Interlocked.Increment() operation, that you have introduced, became a bottleneck. The right way to measure the number of Wait calls is to use a local variable for every thread and sum all that values at the end of the test.

I didn't mean the Pulse and PulseAll methods in vacuum but the consequences they trigger.

Yes, the global consequences from a single Pulse is obviously less than from PulseAll.
And the main idea of this PR is that we should Pulse the minimum possible number of threads that are required for the correct work of the semaphore.

@ikopylov
Copy link
Copy Markdown
Author

Some measurements of "Waits per Second" and "Releases per Second". The numbers are little bit unstable, but the win of fixed semaphore is obvious in "1, 16" setting.

Measurement taken on i7-4770k (8 cores: 4 physical + 4HT).

Original

4, 4:     wait/sec: 2710372.     release/sec:  4940706.
1, 8:     wait/sec: 2925308.     release/sec:  2935367.
1, 16:    wait/sec:  124032.     release/sec:   124032.
16, 1:    wait/sec:  185997.     release/sec: 10468515.

Modified

4, 4:     wait/sec: 2612871.     release/sec:  4936816.
1, 8:     wait/sec: 2957725.     release/sec:  2972438.
1, 16:    wait/sec:  233577.     release/sec:   234073.
16, 1:    wait/sec:  198926.     release/sec: 10776710.

@alexandrnikitin
Copy link
Copy Markdown

I'm thinking about more common usage pattern of a semaphore when you wait for the semaphore, perform some actions and then release. https://github.com/alexandrnikitin/dotnet-coreclr-semaphoreslim-performance/pull/2/files

And here are results: https://gist.github.com/alexandrnikitin/984bcedd7e9813b919cb
1-8 case ~2x times worse.

@ikopylov
Copy link
Copy Markdown
Author

And here are results: https://gist.github.com/alexandrnikitin/984bcedd7e9813b919cb
1-8 case ~2x times worse.

Interesting results. But they are nothing without the answers to the following questions:

  1. Why do the performance degradation happens at all?
  2. Why do the performance degradation happens only in this particular setting (1-8)?

After some analysis, I found the problem in the test itself: spinning interval is too short. If you increase that interval to 50 iterations, the difference disappears.
Thread.SpinWait(10) is basically equivalent to one call of an empty method. And such a small amount of work can never be found in any real-world scenario.

And now the answers

The original implementation of semaphore calls PulseAll on every Release. So all waiting threads wake-up and acquire the lock one after the other. The first of them leaves the Wait method and continues the execution. All other threads normally should re-check the condition and return to the Wait state. But due to the small spinning interval, the first thread calls Release earlier. So some thread, that has been "pulsed" in the previous Release, now observes the newly appeared ticket and exits the lock section.
In other words, there is always several threads in the Running state in the original implementation. They eat the CPU resources, but that does not affect the total time due to the synthetic nature of the test (there is no any background work).

That difference is not observed in "1-4" setting, because 3 threads have enough time to re-check the condition and return to the Waiting state.
The difference is not observed in "1-16" setting due to the contention between threads for the CPU.

I think that the behavior of fixed semaphore is better even in that purely synthetic test. And in the real-world scenarious the modified semaphore should outperform the original all the time.

@Petermarcu
Copy link
Copy Markdown
Member

@gkhanna79 @vancem @brianrob , What are the next steps here? I see we have an issue opened tracking this. I'd prefer not to have PR's sit open for 18 months. We should come up with next steps and track ongoing work in issues.

@vancem
Copy link
Copy Markdown

vancem commented Oct 10, 2016

I have reviewed the code and the discussion above and done some experimentation and I have convinced myself that the change is simple and safe and should have the improvements that the benchmarks demonstrated.

The tests for some reason re-spun, but and now three are failing but they were succeeding before and there has been no change.

Unless someone chimes in today, I will merge this change tomorrow morning.

@gkhanna79 @vancem @brianrob

Vance

@gkhanna79
Copy link
Copy Markdown
Member

I am fine merging this if CI is green @vancem.

CC @kouvel

@vancem
Copy link
Copy Markdown

vancem commented Oct 11, 2016

@dotnet-bot Test OSX x64 Checked Build and Test
@dotnet-bot Test Ubuntu x64 Checked Build and Test

@vancem vancem merged commit 6317f26 into dotnet:master Oct 11, 2016
@vancem
Copy link
Copy Markdown

vancem commented Oct 11, 2016

There are two failures (OSX and Ubuntu) at the current time, however these are currently failing (in the same way) for all builds at this time). Moreover in past this checked DID clear all tests. The change is also only going to affect SlimReaderWriterLock and is not going to be OS specific. Thus I believe it is OK to check in.

sergign60 pushed a commit to sergign60/coreclr that referenced this pull request Nov 14, 2016
* Waiters notification by the value of releaseCount (reduce the number of false-wakeups).
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
* Waiters notification by the value of releaseCount (reduce the number of false-wakeups).


Commit migrated from dotnet/coreclr@6317f26
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.