Improve thread pool performance on Unix#6955
Conversation
BenchmarkDotNet=v0.11.3, OS=macOS Mojave 10.14.2 (18C48a) [Darwin 18.2.0]
Intel Core i5-5675R CPU 3.10GHz (Broadwell), 1 CPU, 4 logical and 4 physical cores
.NET Core SDK=3.0.100-preview-010184
[Host] : .NET Core 2.1.7 (CoreCLR 4.6.27129.04, CoreFX 4.6.27129.04), 64bit RyuJIT
CoreRT : .NET CoreRT 1.0.27503.0, 64bit AOT
Job=CoreRT Runtime=CoreRT
|
|
/cc @VSadov |
3309ce3 to
f214638
Compare
|
The difference on Windows (FeaturePortableThreadPool enabled) seems small, but still favorable for the change. Benchmarking it on a laptop however gives the thermal management a hard time, so it's not very precise. BenchmarkDotNet=v0.11.3, OS=Windows 10.0.17763.253 (1809/October2018Update/Redstone5)
Intel Core i5-7267U CPU 3.10GHz (Kaby Lake), 1 CPU, 4 logical and 2 physical cores
.NET Core SDK=3.0.100-preview-009812
[Host] : .NET Core 2.1.7 (CoreCLR 4.6.27129.04, CoreFX 4.6.27129.04), 64bit RyuJIT
CoreRT : .NET CoreRT 1.0.27503.0, 64bit AOT
Job=CoreRT Runtime=CoreRT
|
feb309e to
4026aa7
Compare
4026aa7 to
938f134
Compare
|
I tried running @benaadams' benchmark from https://github.com/benaadams/ThreadPoolTaskTesting. The results are below, but don't read too much into them since it turns out that on my test machine the numbers vary quite wildly with each run (some of the numbers differ by 20% between two consecutive runs). I am not sure whether there is something I can do about it and if anything can be read from the numbers at all. I also consistently get I didn't try to force high priority for the process which would likely make the numbers more consistent. Unfortunately both my MacBook and iMac eventually hit thermal issues under these tests (generally, any long running tests on thread pool) and either start very loud cooling or CPU throttling. Pre: Post: |
kouvel
left a comment
There was a problem hiding this comment.
Still reviewing, a few comments so far
| class WaiterListEntry | ||
| { | ||
| public LowLevelMonitor _monitor; | ||
| public WaiterListEntry _next; |
There was a problem hiding this comment.
I thought that you may use this implementation for Mono. It shouldn't be necessary to change the implementation in CoreRT for that purpose, what it was doing before is more appropriate for CoreRT. It would probably be beneficial to separate LowLevelLifoSemaphore, which offers just the base LIFO semaphore functionality (no spinning, etc.), from ClrLifoSemaphore, which does all of the rest. Maybe ClrLifoSemaphore could use a better name but I would leave that for later, as there are more abstractions that could be done. ClrLifoSemaphore would use LowLevelLifoSemaphore. With that, the implementation of ClrLifoSemaphore can be shared between runtimes and the only runtime-specific dependency would be LowLevelLifoSemaphore, for which runtimes can provide an implementation that is suitable. For CoreRT, that would be using the wait subsystem, for Mono it may be this implementation, for CoreCLR it would be the PAL, and so on.
There was a problem hiding this comment.
I didn't split up LowLevelLifoSemaphore and ClrLifoSemaphore because technically there's only one user of it - the thread pool. There's probably never going to be another one because there's SemaphoreSlim, so as long as you specifically don't depend on uninterruptibility and LIFO order (at the same time) there's no reason to use this one. LowLevelLock already contained spinning code, so it seemed acceptable that spinning is part of the low-level code. The LowLevel prefix also makes it immediately obvious that it is uninterruptible (as long as one knows the convention or read the long comment in WaitSubsystem).
The runtime sharing is a bit of a PITA part that I had to put some thought into. I am trying hard to avoid a Mono-specific implementation (as long as I can avoid hurting CoreRT at the same time).
While the wait subsystem seems like an obvious choice on CoreRT it doesn't really save any code. In fact it makes the subsystem support LIFO order and uninterruptible Wait which is way more code than the primitive implementation here. Also the current implementation of WaitSubsystem has contention on a single lock which hurts the performance a bit (implementation detail, but a real problem).
My current plan is to break this dependency on WaitSubsystem on this one place and move LowLevelLock/LowLevelMonitor/LowLevelSemaphore into shared part of CoreLib as a feature (eg. FeaturePortableLowLevelSync). That would be shared by CoreRT and Mono. CoreCLR could use it (for testing at least), or simply skip the feature and implement the primitives on its own through existing PAL code.
There was a problem hiding this comment.
Merging LowLevelLifoSemaphore and ClrLifoSemaphore is fine, the wait portions can still be specialized for runtimes, and I think they should be. I really don't want to use this implementation of a LIFO semaphore in CoreRT or CoreCLR, it's unnecessarily more complicated. The tracking being done here and additional overhead is unnecessary for those runtimes, which already do similar tracking to support all of the other types of waits and the tracking is shared so there is no additional overhead with uninterruptibility. Adding uninterruptibility in both runtimes was actually very easy. Conversely it was actually a lot more work to add interruptibility to CoreRT's wait subsystem, which started off uninterruptible during initial implementation. If it's not easy to modify Mono's semaphore to support uninterruptible waits, I would still prefer to separate out this implementation specifically for Mono.
The
LowLevelprefix also makes it immediately obvious that it is uninterruptible (as long as one knows the convention or read the long comment in WaitSubsystem).
At the moment the LowLevel part doesn't just mean uninterruptible, it also means it's a simple implementation that directly uses system components. LowLevelLock does some spinning, in hindsight it probably would have made sense to separate it out. The idea was that the LowLevel portions would be system dependencies that may be specific to runtimes (which may also be shared if possible) and anything on top of them could be shared. Even if there is only one user currently, it can help to componentize the functionality.
While the wait subsystem seems like an obvious choice on CoreRT it doesn't really save any code. In fact it makes the subsystem support LIFO order and uninterruptible Wait which is way more code than the primitive implementation here.
Not really, most of the code is shared with interruptible waits so there is no overhead that is specific to uninterruptible waits.
Also the current implementation of WaitSubsystem has contention on a single lock which hurts the performance a bit (implementation detail, but a real problem).
I don't see how this implementation would be better. Typically these locks are uncontended. It's rare for two threads to both be releasing threads because the thread pool uses an interlocked operation to figure out how many threads need to be released and one of those two threads would release all of them. Perf is not a big concern but the wait subsystem is actually pretty fast.
There was a problem hiding this comment.
Even if there is only one user currently, it can help to componentize the functionality.
Granted sometimes it can be advantageous to not componentize in order to add some features
There was a problem hiding this comment.
I really don't want to use this implementation of a LIFO semaphore in CoreRT or CoreCLR, it's unnecessarily more complicated.
I do have different goals for this, which makes it hard agenda to push. I think it's not unnecessarily more complicated because the amount of code removed from the wait subsystem is more than the amount of code added here. I will do some more benchmarks of different variations, but I actually think that the contention on s_lock in WaitSubsystem hurts the performance significantly for heavily threaded code and this is an easy way out. The suggestion with CoreCLR was not about using it there for production code, but only as a means of bring up of the code in another runtime and for apples-to-apples performance comparison.
At the moment the
LowLevelpart doesn't just mean uninterruptible, it also means it's a simple implementation that directly uses system components.
The documentation states
/// <see cref="LowLevelLock"/> and <see cref="LowLevelMonitor"/>
/// - These are "low level" in the sense they don't depend on this wait subsystem, and any waits done are not
/// interruptible
LowLevelSemaphore never used the system component in Unix, it used the wait subsystem, violating any common sense about the component layering. In that sense calling it "low level" is wrong. The only other shared property is the uninterruptibility. I think it should be implemented using system components, but if I fail to push that direction then it should definitely be renamed to avoid confusion.
Not really, most of the code is shared with interruptible waits so there is no overhead that is specific to uninterruptible waits.
No overhead in terms of performance for other code paths through the WaitSubsystem. Amount of code is different story though. Performance-wise it increases contention on the single lock. It's orthogonal issue to these code changes though.
There was a problem hiding this comment.
re the naming of LowLevelSemaphore: I chose that name moreso to match the naming of LowLevelLinq (subset of API surface reimplemented for runtime use) than LowLevelLock or LowLevelMonitor (thin managed wrappers around native system components).
There was a problem hiding this comment.
It is also not necessary to port the wait subsystem, that is a runtime-specific piece currently. Perhaps it could be shared in the future but it doesn't have to be at the moment. Is there some other agenda you're trying to push for?
The whole point is to share as much code as possible between the runtimes.
Code that is runtime specific doesn't get as much testing as the one that is shared. Likewise the performance improvements have historically been done mostly on the CoreCLR side and didn't translate to CoreRT which was a shame. Now that Mono is sharing the CoreLib code there was a push that helped to fix the gap by moving more of the code to the shared part of the CoreLib.
Ideally I would like to see the wait subsystem shared between all three runtimes, but we are still quite far away from there and it may well never happen.
Decoupling individual components - thread pool, timers, low level synchronization primitives - was an attempt to get there faster, piece by piece. Unfortunately I underestimated the complexity of the LIFO semaphore implementation and its edge cases. I assumed I could do it in less than 100 lines and simplify the WaitSubsystem at the same time to the point that it would be a win-win situation overall (no regression, cleaner component separation, the benefits of code being shared and tested by 2 runtimes and allowing testing the whole managed pool on CoreCLR as well). Eventually I ended up with an implementation that is small enough and covers all the edge cases that were brought up, but it requires unmanaged code and it's not as mature as whatever WaitSubsystem does now.
This implementation adds another monitor per thread, allocates another object per thread, has issues with OOMs...
Yeah, it was flawed. I have completely rewritten it in the end, but it's out of scope of this PR. I solved couple of the issues by moving the LIFO semaphore to unmanaged code, reducing the number of PThread objects (1 mutex + 1 condition/waiting thread) and storing the LIFO entries on stack. (Unfinished code here: https://gist.github.com/filipnavara/4d58444435cb106bbc4e3ffa939e99be, missing comments and propert pthread_cond_t initalization on non-macOS systems).
The global lock is held for a very short time, I have not seen significant contention on that lock.
I have seen it in the heavy threading+locking benchmarks before, but it could have been flaw in the benchmark methodology. I went ahead and re-run a different combinations of the code with my benchmarks. It turns out that the additional logic with interlocked Counts loops avoids getting to the slow path in those heavy threading scenarios.
I think the name is appropriate, I'm not too particular about the name as long as it sort of works, it's difficult to get names right.
I understand it is difficult. I am not happy about the name, but I don't really have a better name right now. The "low level" part makes me feel it should not call into something "higher level" like WaitSubsystem, implementation detail or not.
There was a problem hiding this comment.
re the naming of
LowLevelSemaphore: I chose that name moreso to match the naming ofLowLevelLinq(subset of API surface reimplemented for runtime use) thanLowLevelLockorLowLevelMonitor(thin managed wrappers around native system component
Ah, thanks for helpful historical note :-) LowLevelLinq was not really used in that part of code anymore, aside from a using in one of the files that seems like a leftover.
There was a problem hiding this comment.
A good time to share the implementation would be if a system LIFO wait support were to be found. This implementation of the component can be implemented in managed or native code, would be fairly small, and perf doesn't really matter, that I don't see a huge benefit in sharing it at the moment, especially since there are better alternatives for some runtimes. Sharing the portion above this part (the ClrLifoSemaphore part of tracking spinners/waiters, etc.) is a lot more useful.
There was a problem hiding this comment.
A good time to share the implementation would be if a system LIFO wait support were to be found.
Not going to happen, at least not in a portable way. Every other library I found (notably Facebook Folly, which is referenced from couple of academic papers) uses an implementation similar to the one I came up with.
kouvel
left a comment
There was a problem hiding this comment.
Still reviewing, hope to continue tomorrow
|
|
||
| public static Counts CompareExchange(ref Counts location, Counts newCounts, Counts oldCounts) | ||
| { | ||
| return new Counts { _asLong = Interlocked.CompareExchange(ref location._asLong, newCounts._asLong, oldCounts._asLong) }; |
There was a problem hiding this comment.
It would be more concise to add a constructor that takes the long value and to use that instead of setting the _asLong field on each instantiation.
There was a problem hiding this comment.
Again, copied from ClrThreadPool Counts. I would have likely done that as you suggested but I copied existing code for consistency.
There was a problem hiding this comment.
Not a big deal, probably would be a bit cleaner
|
|
||
| public override int GetHashCode() | ||
| { | ||
| return (int)(_asLong >> 8); |
There was a problem hiding this comment.
This should never be used, perhaps do Debug.Fail() and return 0. Alternatively if you want a somewhat useful value (probably not worth it), could return a sum of each of the components.
There was a problem hiding this comment.
Only copied it for consistency with Counts in ClrThreadPool. It should never be used, so Debug.Fail may be in order.
|
Benchmark after updating the spin count default to match CoreCLR value [but not the PAL *2 multiplier] and spinning algorithm:
|
|
@filipnavara if you are looking for more ThreadPool benchmarks you should take a look at what we have in the performance repo: |
|
@adamsitnik Thanks! I looked in the performance repository before, but I missed that one. I will definitely give it a try. |
* Allow reflecting on DebuggableAttribute on CoreRT See dotnet/corert#6955 (comment). BenchmakeDotNet reflects on this attribute here: https://github.com/dotnet/BenchmarkDotNet/blob/a76f438a81e3e0235a2e9b1249a5280038189689/src/BenchmarkDotNet/Extensions/AssemblyExtensions.cs#L28-L38 * Comment
…n WaitSubsystem, they are no longer used" This reverts commit 938f134.
| } | ||
| else | ||
| { | ||
| RuntimeThread.Yield(); |
There was a problem hiding this comment.
The processorCount <= 1 case is already covered in the initial spin index. It would be a bit cleaner to change the use of SpinYieldThreshold above to SpinSleep0Threshold (then SpinYieldThreshold would not be used anymore and can be removed), and revert this block to what it was before.
There was a problem hiding this comment.
It is not covered because Wait method now also spins on every second iteration. A better solution would be to keep spin counter inside LowLevelSpinWaiter, but that's refactoring I would like to do separately.
There was a problem hiding this comment.
Perhaps Wait could check s_processorCount and sleep instead of spin, instead of checking the proc count at the caller side
There was a problem hiding this comment.
Yes, that's what I would want to do... but the initialization of the static variables and reading of the processor count is kinda funky there. I wasn't sure if it's safe to change it or if the code is possibly reached from some early static constructor initialization and it would result in some dangerous dependency.
There was a problem hiding this comment.
Oh ok, yea the wait subsystem and stuff it uses are used in lazy class construction. It would work to change the Initialize method into a static method (maybe rename to EnsureInitialized or something like that) and call it before any spin-wait loops that use this type. Not sure if it's the best solution. I think I had considered attributing this class with EagerStaticClassConstruction but maybe there was some issue with it, I don't remember the details.
There was a problem hiding this comment.
I tried to reorganize it a little in the last commit (cffc32d). It's not perfect, but it's a bit more readable. I changed it to use PlatformHelper.ProcessorCount which is supposed to respond to change in number of processors and still do reasonable caching that works with the static constructor initialization.
There was a problem hiding this comment.
Damn, looks like it broke the universe. I'll have to debug it on my other machine tomorrow... Reverting for now.
da1f2cf to
34cc5e5
Compare
818e39d to
cd0979d
Compare
… to avoid bringing in too many dependencies
…itial signal count (#22632) - Port of a fix from dotnet/corert#6955 - The underlying semaphore is only used to wake up waiters, initially there are no waiters and the signal count should be zero. This was already the case on Windows, this fixes the Unix side. The actual initial signal count is tracked in the upper layer counts. - The initial signal count passed in is zero anyway in the places where it's used, so it makes no difference for now, just some cleanup
There's couple of objectives to these changes:
WaitSubsystem, to allow sharing them with MonoList of changes:
FirstLevelSpinWaiteris renamed toLowLevelSpinWaiterfor consistency with the usage ofLowLevelprefix in the wait subsystemWaitmethod ofLowLevelSpinWaiterare exposed for reuse fromLowLevelLifoSemaphoreLowLevelLifoSemaphoreis updated to more closely mimic CoreCLR's CLRLifoSemaphore spinning logicLowLevelLifoSemaphore.Unix.csis reimplemented on top ofLowLevel[Monitor/Lock]to remove dependency on the complexWaitSubsystemand its contention on a single lockI used improvised benchmark based on code from dotnet/coreclr#13670 (comment). It is likely flawed, but it should be useful for basic comparison.
Benchmark results are included below. On macOS it is consistent improvement over baseline with no regression on sustained load. I will run the benchmarks on Windows too and report the results. At the moment I don't have a Linux setup to test this on.
This is part of a bigger patch series that is supposed to make the portable thread pool implementation sharable across .NET runtimes. Initially I target CoreRT/Unix and Mono, but I would like to allow building CoreCLR with it for comparison purposes as well.
/cc @kouvel @jkoritzinsky @jkotas @benaadams @marek-safar