Use LibuvThread inspired schedulers in socket transport#2368
Conversation
| { | ||
| internal sealed class SocketTransport : ITransport | ||
| { | ||
| private const int _numSchedulers = 8; |
There was a problem hiding this comment.
TODO: Determine the best way to calculate the optimal number of schedulers.
|
Nice! The thing I don't like about it is the thing you commented on. I don't want to be in the business of tuning the thread count with the sockets transport. That's one of the things we wanted to get away from generally. Can you run plain text as well to see what you get (both oses) |
|
@davidfowl The nice thing is that these schedulers aren't in the business of managing any heavy resources like Threads. The CoalescingScheduler is basically just two queues of Actions a lock. Otherwise it's the normal ThreadPool doing all the work. My still unproven theory is that just using the same calculation we use for the libuv thread count will be reasonable for all of the scenarios I left to test. If we end up merging this PR, we will probably want to make the scheduler count configurable like we do for the libuv transport. |
|
So you are queuing batches to the ThreadPool rather than individual items; this might reduce the number of threadpool threads, thus reducing contention on the global queue, and local queues? Locking is much higher for the queuing to the batch queues, than it is queuing to ThreadPool's global ConcurrentQueue (as each schedule takes a full lock). In your output can you also dump thread counts? /cc @stephentoub |
But you are still responsible for tuning thread counts. The way this is structured, you'll only ever have at most one thread processing work from one of these coalescing schedulers, so you'll need to decide exactly how many of these you want and how you want to divvy work up between them. Isn't that what David was asking about? |
|
@stephentoub I completely agree with your assessment. That's why I put the TODO in the code to dynamically calculate the count and suggested giving a way for the user to configure the number of schedulers (i.e. DoWork loops) like we do with libuv. I didn't mean to imply the number of schedulers doesn't matter. Even quick experimentation changing |
| { | ||
| queue.Dequeue()(); | ||
| } | ||
| } |
There was a problem hiding this comment.
I'm not clear on why this logic with multiple queues is being used. What are you trying to achieve with it? Why not something like:
internal sealed class SingleWorkerScheduler
{
private static readonly WaitCallback s_worker = s => ((SingleWorkerScheduler)s).ProcessItems();
private readonly ConcurrentQueue<Action> _actions = new ConcurrentQueue<Action>();
private bool _hasWorker;
public void Schedule(Action action)
{
_actions.Enqueue(action);
lock (_actions)
{
if (!_hasWorker)
{
ThreadPool.QueueUserWorkItem(s_worker, this);
_hasWorker = true;
}
}
}
private void ProcessItems()
{
try
{
while (_actions.TryDequeue(out Action item))
{
item();
}
}
finally
{
lock (_actions)
{
if (!_actions.IsEmpty)
{
ThreadPool.QueueUserWorkItem(s_worker, this);
_hasWorker = true;
}
else
{
_hasWorker = false;
}
}
}
}
}
Dynamically calculating and spinning up and down workers is a huge amount of work. That's the main thing the thread pool does. And it's way too easy to tune something like this to a particular benchmark and then find that it falls down quickly on others. |
|
I shouldn't have used the word dynamic. I meant that we should consider basing the scheduler count on the number of virtual processors like we do for libuv instead of keeping I'll measure the performance of your SingleWorkerScheduler @stephentoub. The only reason for using two non-concurrent queues is that we've already successfully used that pattern in the libuv transport. |
Ok. Which has better throughput will depend on the use case. Worst case for the two queues one is that you end up in a pattern where every work item dequeue causes you to take the lock and switch queues (and contend with enqueuers), and is thus worse than the concurrent queue. Best case for the two queues one is you end up in a real batch/burst scenario where you're able to fill a queue and swap it, and then there's no per work item synchronization when dequeueing each, just the one for the whole batch. |
Ok. I continue to worry that we're tuning all of this to artificial benchmarks, and that real workloads will suffer as a result. I could be off base, or maybe these workloads really are effectively stand-ins for what people will be doing. |
60c1a0c to
f7b7823
Compare
|
The ConcurrentQueue seems to perform similarly in the JSON benchmark on Windows while simplifying the code, so I updated the PR to use a ConcurrentQueue. Json WindowsSocketsSwapping QueuesConcurrentQueue |
f7b7823 to
45d0e42
Compare
|
You can drop the lock with ConcurrentQueue? public class CoalescingScheduler
{
private static readonly WaitCallback _doWorkCallback = s => ((CoalescingScheduler)s).DoWork();
private readonly ConcurrentQueue<Action> _actions = new ConcurrentQueue<Action>();
private int _doingWork;
public void Schedule(Action action)
{
_actions.Enqueue(action);
if (Interlocked.Exchange(ref _doingWork, 1) == 0)
{
ThreadPool.QueueUserWorkItem(_doWorkCallback, this);
}
}
private void DoWork()
{
ProcessQueue();
// Swtich off flag
Volatile.Write(ref _doingWork, 0);
// Process any items added in race
ProcessQueue();
}
private void ProcessQueue()
{
while (_actions.TryDequeue(out Action item))
{
item();
}
}
} |
|
Probably is a better way to resolve the unset race |
With that scheme, only if you're ok with there being multiple workers running for some period of time. |
Was exercise for reader... 😉 This better? public class CoalescingScheduler
{
private static readonly WaitCallback _doWorkCallback = s => ((CoalescingScheduler)s).DoWork();
private readonly ConcurrentQueue<Action> _actions = new ConcurrentQueue<Action>();
private int _doingWork;
public void Schedule(Action action)
{
_actions.Enqueue(action);
if (Interlocked.Exchange(ref _doingWork, 1) == 0)
{
ThreadPool.QueueUserWorkItem(_doWorkCallback, this);
}
}
private void DoWork()
{
while (_actions.TryDequeue(out Action item))
{
item();
}
// Swtich off flag
Volatile.Write(ref _doingWork, 0);
// Process any items added in race
while (Volatile.Read(ref _doingWork) == 0 && _actions.TryDequeue(out Action item))
{
item();
}
}
} |
Similar issue. DoWork could have just checked _doingWork and found it to be 0, then lots of enqueuers come along and queue a bunch of items, and the first will end up queueing another consumer. Now the original consumer will be dequeueing and executing at least one work item concurrently with another consumer. |
|
I like the idea Ben. I don't see there being a huge issue with having work items for a given scheduler running in parallel as long as it's rare. Unfortunately it doesn't seem to perform better than locking, and locking is a bit simpler conceptually. Json WindowsSocketslock free |
45d0e42 to
e2a16c0
Compare
|
@halter73 get some plaintext numbers. |
| { | ||
| if (!_actions.IsEmpty) | ||
| { | ||
| ThreadPool.QueueUserWorkItem(_doWorkCallback, this); |
There was a problem hiding this comment.
Is this to avoid blocking the thread pool thread for too long?
There was a problem hiding this comment.
That's the theory behind it, but I didn't test the performance of just looping again. I suspect that would perform similarly.
|
|
||
| private void DoWork() | ||
| { | ||
| while (_actions.TryDequeue(out Action item)) |
There was a problem hiding this comment.
Do we need a max number of items to dequeue so we don't starve this thread?
Agreed. If we end up needing to tweak this number for different workloads then I think we've failed. We need to run all of our benchmarks to make sure we're not missing anything glaring. @halter73 If reads do complete synchronously, it's possible for a connection to delay/starve the other work items in a queue. Before that was happening with the IO thread itself, but now we've moved that into our queue. |
Yes. But now it wouldn't be a single connection starving others, but a single scheduler which is responsible for multiple connections. And unlike connections, there's a fixed number of schedulers which will likely be kept under Environment.ProcessorCount. Still, if this is a concern, having a max iteration count like I did in an earlier version of the PR doesn't seem to effect performance much. |
e2a16c0 to
78dc643
Compare
|
Pipelined plaintext performance on Windows is similar to #2366. |
78dc643 to
1d8ec43
Compare
|
@halter73 can you get updated numbers? We got a new corefx build. |
| /// Defaults to half of <see cref="Environment.ProcessorCount" /> rounded down and clamped between 1 and 16. | ||
| /// </remarks> | ||
| //public int IOLoopCountCount { get; set; } = ProcessorThreadCount; | ||
| public int IOLoopCountCount { get; set; } = Math.Min(Environment.ProcessorCount, 16); |
There was a problem hiding this comment.
Is it possible to run queueless?
Json Windows SocketsPlaintext Windows Sockets |
cfb0250 to
dc6fddd
Compare
|
Do you have some thoughts what gets coalesced here? This could be work items which are created from the epoll thread successively. Currently, these work items do the i/o operations (e.g. read/write from the socket). This PR may cause those operations to be serialized. It would be interesting to see a benchmark when the epoll thread performed the i/o operation and then created a work item. Then read/write operations would be serialized already on the epoll thread. |
Work items queued to these schedulers run user code, right? If so, how does limiting max iterations help if a single work item blocks for a period of time? Seems like it doing so will end up starving every other connection assigned to that scheduler. Even worse, and this may be stretching things, but imagine code did: httpClient.PostAsync(uriBackToSameServer).Wait();That could then end up deadlocking many connections, right? |
| } | ||
| } | ||
|
|
||
| private class Work |
There was a problem hiding this comment.
Why a class rather than a struct?
There was a problem hiding this comment.
Originally this was a struct like it is in LibuvThread where I copied this from. I was comparing the performance of queueing a struct vs. a class vs. an implicitly allocated closure. Whatever difference there is between the three was indistinguishable from the noise of the JSON and plaintext benchmarks so I forgot about it and left it as a class. I'll change it back to a struct though.
|
@stephentoub no user code runs on these schedulers, just our code (the code in the SocketConnection itself).
On linux it ends up being a double dispatch. @halter73 We could avoid the queue if we're already on a thread pool thread in the async SocketAwaitable. |
| { | ||
| var work = new Work | ||
| { | ||
| CallbackAdapter = (c, s) => ((Action<T>)c)((T)s), |
There was a problem hiding this comment.
If the adapter is the same for every work item, why take the space to store the delegate? I assume this will change once David's change goes through to make the scheduler state object instead of T?
There was a problem hiding this comment.
We won't need this once state goes back to object.
Technically the CallbackAdapter changes as T does. My understanding is that the Action<object, object> assigned here is static but it's static per T meaning a new Action<object, object> gets initialized the first time this method gets called with a given T. It should be the equivalent to what we do here.
Ah, ok. |
9a9a8b8 to
5466368
Compare
|
I have a theory why dedicating threads to I/O and thread hoping for the request processing improves performance. My theory is that it is a data-caching effect. Basically by having the I/O threads do pretty much nothing but I/O, they keep the code needed for that in their L1 cache. Thus I/O threads can run faster (because their L1 cache is 'precharged' with the data they need). Yes, you have the costs of handoff, but those happen only twice per request, and the actual data are at least in L2 so it is easy to come out ahead by specializing. I want to think about whether we can test that this is what is happening, and whether we can leverage it more. |
|
By the way both the libuv and the sockets (before or after) show that we only get to 85% CPU utilization.
That is not where we want to be. We want to be at 100% (unless we know that we are network constrained?) I did not think you could do that with the JSON benchmark. More things to look at. |
|
@vancem I agree that the perf improvement probably has something to do with the reduced context switching. I also think giving subsets of connections their own IO queues reduces contention. |
|
I am not actually suggesting fewer context switches are the reason for the improved performance. In fact I would expect MORE context switching in the new approach. I am suggesting that it is because by ONLY doing the OS part of the read (which is a big chunk of the total processing), you allow those I/O threads to SPECIALIZE in doing just that (their L1 caches have all the OS code and data needed to do that), and thus run faster (since memory latency is often the limiting factor for CPU performance). You can go further (as you have) and break the pipeline into more 'chunks' where each CPU can 'specialize in doing one chunk. however the problem is that the task scheduler is not 'smart' about scheduling these chunks, which will 'ruin' the cache that was built up (by the way you would also expect results to vary depending on the size of the machine caches (which doe vary from machine to machine)). I am thinking about ways of both validating and maximizing the good effect here. Basically make the threadpool 'smart' about allowing threads to specialize in doing one particular thing. |
|
Was speculating on this; obv from a blackbox level... best I could come up with was With immediate dispatching there's mostly likely a thread waiting for the IO so the data can be immediately handed over to user-space. When the IO thread handles some of the request processing there isn't a thread waiting so the kernel needs to buffer the data then do an additional kernel->user mode copy when the thread comes back. There would be an additional cost from thread switching and potentially cold caches from the additional dispatch; however the direct to user mode data might more than make up for this? Meltdown and Spectre probably haven't helped. Just an idea... |
|
It is easy enough to validate some of these things with ETW. We will know context switches and we can get DCache miss rates, and test in scnearios where more or less 'specialization' by thread/processor is possible. I am playing with this now... |
Inspired by #2366. Looks like it might be a small improvement. We'll need to test more scenarios.
Json Windows
Libuv
Sockets
Before
#2366
After