threading: lock-free fast path for SemaphoreSlim.WaitAsync#125452
threading: lock-free fast path for SemaphoreSlim.WaitAsync#125452thomhurst wants to merge 4 commits intodotnet:mainfrom
Conversation
|
Tagging subscribers to this area: @agocke, @VSadov |
|
@EgorBot -intel -amd -arm using System.Threading;
using System.Threading.Tasks;
using BenchmarkDotNet.Attributes;
[MemoryDiagnoser]
public class SemaphoreSlimUncontended
{
private SemaphoreSlim _sem = new SemaphoreSlim(1, 1);
[Benchmark]
public async Task WaitAsync_Release()
{
await _sem.WaitAsync();
_sem.Release();
}
} |
There was a problem hiding this comment.
Pull request overview
This PR introduces a lock-free fast path in SemaphoreSlim.WaitAsync that attempts to acquire an available permit via CAS, avoiding taking m_lockObjAndDisposed when uncontended.
Changes:
- Added a CAS-based fast path to decrement
m_currentCountwhen a permit appears immediately available. - Added special-case handling to keep
AvailableWaitHandlestate consistent if it’s initialized concurrently during the fast-path acquire.
src/libraries/System.Private.CoreLib/src/System/Threading/SemaphoreSlim.cs
Show resolved
Hide resolved
|
Doesn't lock itself has fast paths for that? |
Use Interlocked.Add to apply a relative delta to m_currentCount rather than writing back an absolute snapshot-derived value, so concurrent lock-free decrements from the WaitAsync fast path are not overwritten.
| if (current > 0 | ||
| && Volatile.Read(ref m_asyncHead) is null | ||
| && Volatile.Read(ref m_waitCount) == 0 | ||
| && Interlocked.CompareExchange(ref m_currentCount, current - 1, current) == current) |
There was a problem hiding this comment.
The fast-path gating reads m_asyncHead and m_waitCount outside the lock, but both fields are updated under the lock without volatile/Volatile.Write semantics. These checks can legally observe stale values, letting the CAS decrement run while a Release/Wait* locked path is concurrently releasing waiters / updating counts, which can corrupt accounting (e.g., negative m_currentCount or too many waiters completed). Please make the “no waiters” signal reliable (e.g., make m_waitCount and m_asyncHead volatile and ensure all writes are volatile, or add a dedicated volatile/Interlocked state flag used by the fast path), and consider making in-lock decrements of m_currentCount CAS-based so they can’t be invalidated by a concurrent lock-free acquire.
There was a problem hiding this comment.
m_asyncHead doesn't need volatile writes. When m_asyncHead is non-null, the invariant guarantees m_currentCount == 0 — so the current > 0 guard independently blocks the CAS. A stale non-null read is conservatively safe (just an unnecessary slow-path trip). A stale null read when actually non-null can't cause a double-grant because m_currentCount would also be 0.
m_waitCount stale reads are a real concern, but the root fix isn't volatile writes — it's making the in-lock check-then-decrement atomic. Even with Volatile.Write on m_waitCount, a fast-path thread that read m_waitCount = 0 before WaitCore's write can still have its CAS execute between WaitCore's line 404 check and the decrement. Replaced both in-lock plain decrements (WaitCore and WaitAsyncCore slow path) with the same CAS-loop pattern used in the fast path. Both threads' operations now serialize on m_currentCount atomically — only one wins, eliminating the double-grant.
Replace plain --m_currentCount with a CAS loop to prevent a double grant when the lock-free WaitAsync fast path decrements m_currentCount between the > 0 check and the decrement in the slow path. WaitCore is safe because m_waitCount++ on lock entry blocks the CAS guard for its entire critical section. WaitAsyncCore has no such protection.
Apply the same CAS-loop pattern to WaitCore's m_currentCount decrement that was applied to WaitAsyncCore in the previous commit. A fast-path thread that read m_waitCount = 0 before WaitCore's m_waitCount++ can still race with WaitCore's check-at-404 / decrement-at-407 sequence. The CAS loop serializes both operations on m_currentCount atomically.
|
@EgorBot -intel -amd -arm using System.Threading;
using System.Threading.Tasks;
using BenchmarkDotNet.Attributes;
[MemoryDiagnoser]
public class SemaphoreSlimUncontended
{
private SemaphoreSlim _sem = new SemaphoreSlim(1, 1);
[Benchmark]
public async Task WaitAsync_Release()
{
await _sem.WaitAsync();
_sem.Release();
}
} |
| // that synchronous waiter succeeds so that they have a chance to release. | ||
| Debug.Assert(!waitSuccessful || m_currentCount > 0, | ||
| "If the wait was successful, there should be count available."); | ||
| if (m_currentCount > 0) | ||
| // Use CAS rather than a plain decrement: the lock-free fast path in WaitAsync | ||
| // can decrement m_currentCount concurrently (it holds no lock). | ||
| int currentCount = m_currentCount; | ||
| while (currentCount > 0 && Interlocked.CompareExchange(ref m_currentCount, currentCount - 1, currentCount) != currentCount) | ||
| currentCount = m_currentCount; |
There was a problem hiding this comment.
The Debug.Assert that assumes waitSuccessful implies m_currentCount > 0 is no longer valid now that WaitAsync can decrement m_currentCount concurrently without taking the lock. In Debug builds this can fire spuriously (e.g., WaitUntilCountOrTimeout observed a positive count, but the lock-free fast path acquired the last permit before the assertion executes). Consider removing this assert or rewriting it to assert something that remains true under concurrent lock-free decrements (e.g., assert only after the CAS-based acquire attempt succeeds).
| // Fast path: try a lock-free acquire; falls through to the lock if it fails. | ||
| // Skipped when m_waitHandle is non-null to keep its state consistent under the lock. | ||
| if (m_waitHandle is null) | ||
| { | ||
| int current = m_currentCount; | ||
| // The waiter checks are best-effort: a sync waiter incrementing m_waitCount inside | ||
| // the lock may not be visible yet, but the CAS will fail if the count has changed. | ||
| if (current > 0 | ||
| && Volatile.Read(ref m_asyncHead) is null | ||
| && Volatile.Read(ref m_waitCount) == 0 | ||
| && Interlocked.CompareExchange(ref m_currentCount, current - 1, current) == current) | ||
| { | ||
| // Handle the rare race where AvailableWaitHandle was initialised concurrently. | ||
| if (current == 1 && m_waitHandle is not null) | ||
| { | ||
| lock (m_lockObjAndDisposed) | ||
| { | ||
| if (m_waitHandle is not null && m_currentCount == 0) | ||
| m_waitHandle.Reset(); | ||
| } | ||
| } | ||
| return Task.FromResult(true); |
There was a problem hiding this comment.
This change introduces a lock-free acquire path with special handling for the race where AvailableWaitHandle is initialized concurrently (the current == 1 && m_waitHandle is not null branch). There doesn't appear to be existing test coverage around SemaphoreSlim.AvailableWaitHandle state transitions under concurrent WaitAsync/initialization; adding a stress/regression test for this scenario would help prevent subtle signaling regressions (e.g., wait handle remaining signaled when the count reaches 0 via the fast path).
|
@EgorBo I think just by not entering the lock we can save some time: EgorBot/Benchmarks#31 |
Use a lock-free CAS fast path in SemaphoreSlim.WaitAsync to skip the Monitor lock when a permit is immediately available, improving uncontended throughput