Skip to content
This repository was archived by the owner on Dec 18, 2018. It is now read-only.

Use LibuvThread inspired schedulers in socket transport#2368

Merged
halter73 merged 7 commits into
devfrom
halter73/socket-coalesce
Mar 9, 2018
Merged

Use LibuvThread inspired schedulers in socket transport#2368
halter73 merged 7 commits into
devfrom
halter73/socket-coalesce

Conversation

@halter73
Copy link
Copy Markdown
Member

@halter73 halter73 commented Mar 7, 2018

Inspired by #2366. Looks like it might be a small improvement. We'll need to test more scenarios.

Json Windows

Libuv

[03:23:17.655] RequestsPerSecond:           452637
[03:23:17.656] Latency on load (ms):        0.7
[03:23:17.656] Max CPU (%):                 85
[03:23:17.656] WorkingSet (MB):             386
[03:23:17.656] Startup Main (ms):           419
[03:23:17.656] First Request (ms):          232.8
[03:23:17.657] Latency (ms):                0.8
[03:23:17.657] Socket Errors:               0
[03:23:17.657] Bad Responses:               0
[03:23:17.657] SDK:                         2.2.0-preview1-007522
[03:23:17.657] Runtime:                     2.1.0-preview2-26225-03
[03:23:17.657] ASP.NET Core:                2.1.0-preview2-30256

Sockets

Before

[03:21:06.064] RequestsPerSecond:           287068
[03:21:06.065] Latency on load (ms):        3
[03:21:06.065] Max CPU (%):                 84
[03:21:06.065] WorkingSet (MB):             360
[03:21:06.065] Startup Main (ms):           382
[03:21:06.065] First Request (ms):          228.6
[03:21:06.066] Latency (ms):                0.7
[03:21:06.066] Socket Errors:               0
[03:21:06.066] Bad Responses:               0
[03:21:06.066] SDK:                         2.2.0-preview1-007522
[03:21:06.066] Runtime:                     2.1.0-preview2-26225-03
[03:21:06.066] ASP.NET Core:                2.1.0-preview2-30255

#2366

[03:18:42.707] RequestsPerSecond:           342340
[03:18:42.707] Latency on load (ms):        1.4
[03:18:42.708] Max CPU (%):                 85
[03:18:42.708] WorkingSet (MB):             363
[03:18:42.708] Startup Main (ms):           376
[03:18:42.708] First Request (ms):          228.9
[03:18:42.708] Latency (ms):                0.8
[03:18:42.708] Socket Errors:               0
[03:18:42.708] Bad Responses:               0
[03:18:42.708] SDK:                         2.2.0-preview1-007522
[03:18:42.708] Runtime:                     2.1.0-preview2-26225-03
[03:18:42.708] ASP.NET Core:                2.1.0-preview2-30255

After

[03:19:41.107] RequestsPerSecond:           376305
[03:19:41.107] Latency on load (ms):        0.8
[03:19:41.107] Max CPU (%):                 82
[03:19:41.107] WorkingSet (MB):             368
[03:19:41.108] Startup Main (ms):           389
[03:19:41.108] First Request (ms):          230.1
[03:19:41.108] Latency (ms):                0.9
[03:19:41.108] Socket Errors:               0
[03:19:41.108] Bad Responses:               0
[03:19:41.108] SDK:                         2.2.0-preview1-007522
[03:19:41.108] Runtime:                     2.1.0-preview2-26225-03
[03:19:41.108] ASP.NET Core:                2.1.0-preview2-30255

@halter73 halter73 requested a review from davidfowl March 7, 2018 11:26
{
internal sealed class SocketTransport : ITransport
{
private const int _numSchedulers = 8;
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Determine the best way to calculate the optimal number of schedulers.

@davidfowl
Copy link
Copy Markdown
Member

Nice! The thing I don't like about it is the thing you commented on. I don't want to be in the business of tuning the thread count with the sockets transport. That's one of the things we wanted to get away from generally.

Can you run plain text as well to see what you get (both oses)

@halter73
Copy link
Copy Markdown
Member Author

halter73 commented Mar 7, 2018

@davidfowl The nice thing is that these schedulers aren't in the business of managing any heavy resources like Threads. The CoalescingScheduler is basically just two queues of Actions a lock. Otherwise it's the normal ThreadPool doing all the work.

My still unproven theory is that just using the same calculation we use for the libuv thread count will be reasonable for all of the scenarios I left to test. If we end up merging this PR, we will probably want to make the scheduler count configurable like we do for the libuv transport.

@benaadams
Copy link
Copy Markdown
Contributor

benaadams commented Mar 7, 2018

So you are queuing batches to the ThreadPool rather than individual items; this might reduce the number of threadpool threads, thus reducing contention on the global queue, and local queues?

Locking is much higher for the queuing to the batch queues, than it is queuing to ThreadPool's global ConcurrentQueue (as each schedule takes a full lock).

In your output can you also dump thread counts?

/cc @stephentoub

@stephentoub
Copy link
Copy Markdown
Contributor

stephentoub commented Mar 7, 2018

I don't want to be in the business of tuning the thread count with the sockets transport.

The nice thing is that these schedulers aren't in the business of managing any heavy resources like Threads. The CoalescingScheduler is basically just two queues of Actions a lock. Otherwise it's the normal ThreadPool doing all the work.

But you are still responsible for tuning thread counts. The way this is structured, you'll only ever have at most one thread processing work from one of these coalescing schedulers, so you'll need to decide exactly how many of these you want and how you want to divvy work up between them. Isn't that what David was asking about?

@halter73
Copy link
Copy Markdown
Member Author

halter73 commented Mar 7, 2018

@stephentoub I completely agree with your assessment. That's why I put the TODO in the code to dynamically calculate the count and suggested giving a way for the user to configure the number of schedulers (i.e. DoWork loops) like we do with libuv.

I didn't mean to imply the number of schedulers doesn't matter. Even quick experimentation changing _numSchedulers values shows what a big impact it has.

{
queue.Dequeue()();
}
}
Copy link
Copy Markdown
Contributor

@stephentoub stephentoub Mar 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not clear on why this logic with multiple queues is being used. What are you trying to achieve with it? Why not something like:

internal sealed class SingleWorkerScheduler
{
    private static readonly WaitCallback s_worker = s => ((SingleWorkerScheduler)s).ProcessItems();
    private readonly ConcurrentQueue<Action> _actions = new ConcurrentQueue<Action>();
    private bool _hasWorker;

    public void Schedule(Action action)
    {
        _actions.Enqueue(action);
        lock (_actions)
        {
            if (!_hasWorker)
            {
                ThreadPool.QueueUserWorkItem(s_worker, this);
                _hasWorker = true;
            }
        }
    }

    private void ProcessItems()
    {
        try
        {
            while (_actions.TryDequeue(out Action item))
            {
                item();
            }
        }
        finally
        {
            lock (_actions)
            {
                if (!_actions.IsEmpty)
                {
                    ThreadPool.QueueUserWorkItem(s_worker, this);
                    _hasWorker = true;
                }
                else
                {
                    _hasWorker = false;
                }
            }
        }
    }
}

@stephentoub
Copy link
Copy Markdown
Contributor

I completely agree with your assessment. That's why I put the TODO in the code to dynamically calculate the count and suggested giving a way for the user to configure the number of schedulers (i.e. DoWork loops) like we do with libuv. I didn't mean to imply the number of schedulers doesn't matter. Even quick experimentation changing _numSchedulers values shows what a big impact it has

Dynamically calculating and spinning up and down workers is a huge amount of work. That's the main thing the thread pool does. And it's way too easy to tune something like this to a particular benchmark and then find that it falls down quickly on others.

@halter73
Copy link
Copy Markdown
Member Author

halter73 commented Mar 7, 2018

I shouldn't have used the word dynamic. I meant that we should consider basing the scheduler count on the number of virtual processors like we do for libuv instead of keeping _numSchedulers as a const with a value of 8. I didn't mean we should look into dynamically changing the number of schedulers in an already-running server instance.

I'll measure the performance of your SingleWorkerScheduler @stephentoub. The only reason for using two non-concurrent queues is that we've already successfully used that pattern in the libuv transport.

@stephentoub
Copy link
Copy Markdown
Contributor

he only reason for using two non-concurrent queues is that we've already successfully used that pattern in the libuv transport.

Ok. Which has better throughput will depend on the use case. Worst case for the two queues one is that you end up in a pattern where every work item dequeue causes you to take the lock and switch queues (and contend with enqueuers), and is thus worse than the concurrent queue. Best case for the two queues one is you end up in a real batch/burst scenario where you're able to fill a queue and swap it, and then there's no per work item synchronization when dequeueing each, just the one for the whole batch.

@stephentoub
Copy link
Copy Markdown
Contributor

I meant that we should consider basing the scheduler count on the number of virtual processors like we do for libuv instead of keeping _numSchedulers as a const with a value of 8.

Ok. I continue to worry that we're tuning all of this to artificial benchmarks, and that real workloads will suffer as a result. I could be off base, or maybe these workloads really are effectively stand-ins for what people will be doing.

@halter73 halter73 force-pushed the halter73/socket-coalesce branch from 60c1a0c to f7b7823 Compare March 7, 2018 22:48
@halter73
Copy link
Copy Markdown
Member Author

halter73 commented Mar 7, 2018

The ConcurrentQueue seems to perform similarly in the JSON benchmark on Windows while simplifying the code, so I updated the PR to use a ConcurrentQueue.

Json Windows

Sockets

Swapping Queues

[02:50:24.579] RequestsPerSecond:           376675
[02:50:24.579] Latency on load (ms):        0.9
[02:50:24.579] Max CPU (%):                 82
[02:50:24.579] WorkingSet (MB):             377
[02:50:24.579] Startup Main (ms):           380
[02:50:24.579] First Request (ms):          233.6
[02:50:24.580] Latency (ms):                0.7
[02:50:24.580] Socket Errors:               0
[02:50:24.580] Bad Responses:               0
[02:50:24.580] SDK:                         2.2.0-preview1-007522
[02:50:24.580] Runtime:                     2.1.0-preview2-26225-03
[02:50:24.580] ASP.NET Core:                2.1.0-preview2-30261

ConcurrentQueue

[02:49:30.009] RequestsPerSecond:           379339
[02:49:30.009] Latency on load (ms):        0.8
[02:49:30.009] Max CPU (%):                 82
[02:49:30.009] WorkingSet (MB):             372
[02:49:30.009] Startup Main (ms):           387
[02:49:30.009] First Request (ms):          232.8
[02:49:30.010] Latency (ms):                0.9
[02:49:30.010] Socket Errors:               0
[02:49:30.010] Bad Responses:               0
[02:49:30.010] SDK:                         2.2.0-preview1-007522
[02:49:30.010] Runtime:                     2.1.0-preview2-26225-03
[02:49:30.010] ASP.NET Core:                2.1.0-preview2-30261

@halter73 halter73 force-pushed the halter73/socket-coalesce branch from f7b7823 to 45d0e42 Compare March 7, 2018 23:00
@benaadams
Copy link
Copy Markdown
Contributor

You can drop the lock with ConcurrentQueue?

public class CoalescingScheduler
{
    private static readonly WaitCallback _doWorkCallback = s => ((CoalescingScheduler)s).DoWork();

    private readonly ConcurrentQueue<Action> _actions = new ConcurrentQueue<Action>();

    private int _doingWork;

    public void Schedule(Action action)
    {
        _actions.Enqueue(action);

        if (Interlocked.Exchange(ref _doingWork, 1) == 0)
        {
            ThreadPool.QueueUserWorkItem(_doWorkCallback, this);
        }
    }

    private void DoWork()
    {
        ProcessQueue();
        // Swtich off flag
        Volatile.Write(ref _doingWork, 0);
        // Process any items added in race
        ProcessQueue();
    }

    private void ProcessQueue()
    {
        while (_actions.TryDequeue(out Action item))
        {
            item();
        }
    }
}

@benaadams
Copy link
Copy Markdown
Contributor

benaadams commented Mar 7, 2018

Probably is a better way to resolve the unset race

@stephentoub
Copy link
Copy Markdown
Contributor

You can drop the lock with ConcurrentQueue?

With that scheme, only if you're ok with there being multiple workers running for some period of time.

@benaadams
Copy link
Copy Markdown
Contributor

With that scheme, only if you're ok with there being multiple workers running for some period of time.

Was exercise for reader... 😉

This better?

public class CoalescingScheduler
{
    private static readonly WaitCallback _doWorkCallback = s => ((CoalescingScheduler)s).DoWork();

    private readonly ConcurrentQueue<Action> _actions = new ConcurrentQueue<Action>();

    private int _doingWork;

    public void Schedule(Action action)
    {
        _actions.Enqueue(action);

        if (Interlocked.Exchange(ref _doingWork, 1) == 0)
        {
            ThreadPool.QueueUserWorkItem(_doWorkCallback, this);
        }
    }

    private void DoWork()
    {
        while (_actions.TryDequeue(out Action item))
        {
            item();
        }
        // Swtich off flag
        Volatile.Write(ref _doingWork, 0);

        // Process any items added in race
        while (Volatile.Read(ref _doingWork) == 0 && _actions.TryDequeue(out Action item))
        {
            item();
        }
    }
}

@stephentoub
Copy link
Copy Markdown
Contributor

This better?

Similar issue. DoWork could have just checked _doingWork and found it to be 0, then lots of enqueuers come along and queue a bunch of items, and the first will end up queueing another consumer. Now the original consumer will be dequeueing and executing at least one work item concurrently with another consumer.

@halter73
Copy link
Copy Markdown
Member Author

halter73 commented Mar 8, 2018

I like the idea Ben. I don't see there being a huge issue with having work items for a given scheduler running in parallel as long as it's rare. Unfortunately it doesn't seem to perform better than locking, and locking is a bit simpler conceptually.

Json Windows

Sockets

lock free

[04:21:06.395] RequestsPerSecond:           373138
[04:21:06.395] Latency on load (ms):        0.9
[04:21:06.396] Max CPU (%):                 84
[04:21:06.396] WorkingSet (MB):             370
[04:21:06.396] Startup Main (ms):           395
[04:21:06.396] First Request (ms):          233.2
[04:21:06.396] Latency (ms):                0.8
[04:21:06.397] Socket Errors:               0
[04:21:06.397] Bad Responses:               0
[04:21:06.397] SDK:                         2.2.0-preview1-007522
[04:21:06.397] Runtime:                     2.1.0-preview2-26225-03
[04:21:06.397] ASP.NET Core:                2.1.0-preview2-30263

@halter73 halter73 force-pushed the halter73/socket-coalesce branch from 45d0e42 to e2a16c0 Compare March 8, 2018 00:54
@davidfowl
Copy link
Copy Markdown
Member

@halter73 get some plaintext numbers.

{
if (!_actions.IsEmpty)
{
ThreadPool.QueueUserWorkItem(_doWorkCallback, this);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this to avoid blocking the thread pool thread for too long?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the theory behind it, but I didn't test the performance of just looping again. I suspect that would perform similarly.


private void DoWork()
{
while (_actions.TryDequeue(out Action item))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a max number of items to dequeue so we don't starve this thread?

@davidfowl
Copy link
Copy Markdown
Member

@stephentoub

Ok. I continue to worry that we're tuning all of this to artificial benchmarks, and that real workloads will suffer as a result. I could be off base, or maybe these workloads really are effectively stand-ins for what people will be doing.

Agreed. If we end up needing to tweak this number for different workloads then I think we've failed. We need to run all of our benchmarks to make sure we're not missing anything glaring.

@halter73 If reads do complete synchronously, it's possible for a connection to delay/starve the other work items in a queue. Before that was happening with the IO thread itself, but now we've moved that into our queue.

@halter73
Copy link
Copy Markdown
Member Author

halter73 commented Mar 8, 2018

@halter73 If reads do complete synchronously, it's possible for a connection to delay/starve the other work items in a queue. Before that was happening with the IO thread itself, but now we've moved that into our queue.

Yes. But now it wouldn't be a single connection starving others, but a single scheduler which is responsible for multiple connections. And unlike connections, there's a fixed number of schedulers which will likely be kept under Environment.ProcessorCount.

Still, if this is a concern, having a max iteration count like I did in an earlier version of the PR doesn't seem to effect performance much.

@halter73 halter73 force-pushed the halter73/socket-coalesce branch from e2a16c0 to 78dc643 Compare March 9, 2018 01:30
@halter73
Copy link
Copy Markdown
Member Author

halter73 commented Mar 9, 2018

Pipelined plaintext performance on Windows is similar to #2366.

[05:34:44.537] RequestsPerSecond:           1795988
[05:34:44.537] Latency on load (ms):        2.3
[05:34:44.537] Max CPU (%):                 91
[05:34:44.537] WorkingSet (MB):             373
[05:34:44.538] Startup Main (ms):           377
[05:34:44.538] First Request (ms):          141.2
[05:34:44.538] Latency (ms):                0.6
[05:34:44.538] Socket Errors:               0
[05:34:44.538] Bad Responses:               0
[05:34:44.538] SDK:                         2.2.0-preview1-007522
[05:34:44.538] Runtime:                     2.1.0-preview2-26225-03
[05:34:44.538] ASP.NET Core:                2.1.0-preview2-30272

@halter73 halter73 force-pushed the halter73/socket-coalesce branch from 78dc643 to 1d8ec43 Compare March 9, 2018 08:20
@davidfowl
Copy link
Copy Markdown
Member

@halter73 can you get updated numbers? We got a new corefx build.

/// Defaults to half of <see cref="Environment.ProcessorCount" /> rounded down and clamped between 1 and 16.
/// </remarks>
//public int IOLoopCountCount { get; set; } = ProcessorThreadCount;
public int IOLoopCountCount { get; set; } = Math.Min(Environment.ProcessorCount, 16);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QueueCount?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to run queueless?

@halter73
Copy link
Copy Markdown
Member Author

halter73 commented Mar 9, 2018

Json Windows Sockets

[02:15:16.387] RequestsPerSecond:           371892
[02:15:16.387] Latency on load (ms):        1
[02:15:16.387] Max CPU (%):                 82
[02:15:16.387] WorkingSet (MB):             371
[02:15:16.387] Startup Main (ms):           379
[02:15:16.388] First Request (ms):          235.2
[02:15:16.388] Latency (ms):                0.7
[02:15:16.388] Socket Errors:               0
[02:15:16.388] Bad Responses:               0
[02:15:16.388] SDK:                         2.2.0-preview1-007522
[02:15:16.388] Runtime:                     2.1.0-preview2-26308-01
[02:15:16.388] ASP.NET Core:                2.1.0-preview2-30277

Plaintext Windows Sockets

[02:34:14.925] RequestsPerSecond:           1831133
[02:34:14.925] Latency on load (ms):        2.4
[02:34:14.926] Max CPU (%):                 91
[02:34:14.926] WorkingSet (MB):             390
[02:34:14.926] Startup Main (ms):           380
[02:34:14.926] First Request (ms):          148.7
[02:34:14.926] Latency (ms):                0.6
[02:34:14.926] Socket Errors:               0
[02:34:14.926] Bad Responses:               0
[02:34:14.926] SDK:                         2.2.0-preview1-007522
[02:34:14.926] Runtime:                     2.1.0-preview2-26308-01
[02:34:14.926] ASP.NET Core:                2.1.0-preview2-30277

@halter73 halter73 force-pushed the halter73/socket-coalesce branch from cfb0250 to dc6fddd Compare March 9, 2018 10:39
@tmds
Copy link
Copy Markdown
Contributor

tmds commented Mar 9, 2018

Do you have some thoughts what gets coalesced here?

This could be work items which are created from the epoll thread successively. Currently, these work items do the i/o operations (e.g. read/write from the socket). This PR may cause those operations to be serialized. It would be interesting to see a benchmark when the epoll thread performed the i/o operation and then created a work item. Then read/write operations would be serialized already on the epoll thread.

@stephentoub
Copy link
Copy Markdown
Contributor

stephentoub commented Mar 9, 2018

Yes. But now it wouldn't be a single connection starving others, but a single scheduler which is responsible for multiple connections. And unlike connections, there's a fixed number of schedulers which will likely be kept under Environment.ProcessorCount. Still, if this is a concern, having a max iteration count like I did in an earlier version of the PR doesn't seem to effect performance much.

Work items queued to these schedulers run user code, right? If so, how does limiting max iterations help if a single work item blocks for a period of time? Seems like it doing so will end up starving every other connection assigned to that scheduler.

Even worse, and this may be stretching things, but imagine code did:

httpClient.PostAsync(uriBackToSameServer).Wait();

That could then end up deadlocking many connections, right?

}
}

private class Work
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why a class rather than a struct?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally this was a struct like it is in LibuvThread where I copied this from. I was comparing the performance of queueing a struct vs. a class vs. an implicitly allocated closure. Whatever difference there is between the three was indistinguishable from the noise of the JSON and plaintext benchmarks so I forgot about it and left it as a class. I'll change it back to a struct though.

@davidfowl
Copy link
Copy Markdown
Member

@stephentoub no user code runs on these schedulers, just our code (the code in the SocketConnection itself).

  • It's used to dispatch work from the IO thread this socket is associated with (on windows)

On linux it ends up being a double dispatch. @halter73 We could avoid the queue if we're already on a thread pool thread in the async SocketAwaitable.

{
var work = new Work
{
CallbackAdapter = (c, s) => ((Action<T>)c)((T)s),
Copy link
Copy Markdown
Contributor

@stephentoub stephentoub Mar 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the adapter is the same for every work item, why take the space to store the delegate? I assume this will change once David's change goes through to make the scheduler state object instead of T?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We won't need this once state goes back to object.

Technically the CallbackAdapter changes as T does. My understanding is that the Action<object, object> assigned here is static but it's static per T meaning a new Action<object, object> gets initialized the first time this method gets called with a given T. It should be the equivalent to what we do here.

@stephentoub
Copy link
Copy Markdown
Contributor

no user code runs on these schedulers, just our code (the code in the SocketConnection itself).

Ah, ok.

Copy link
Copy Markdown
Member

@davidfowl davidfowl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@halter73 halter73 force-pushed the halter73/socket-coalesce branch from 9a9a8b8 to 5466368 Compare March 9, 2018 22:56
@halter73 halter73 merged commit 71bff00 into dev Mar 9, 2018
@vancem
Copy link
Copy Markdown

vancem commented Mar 12, 2018

I have a theory why dedicating threads to I/O and thread hoping for the request processing improves performance.

My theory is that it is a data-caching effect. Basically by having the I/O threads do pretty much nothing but I/O, they keep the code needed for that in their L1 cache. Thus I/O threads can run faster (because their L1 cache is 'precharged' with the data they need). Yes, you have the costs of handoff, but those happen only twice per request, and the actual data are at least in L2 so it is easy to come out ahead by specializing.

I want to think about whether we can test that this is what is happening, and whether we can leverage it more.

@vancem
Copy link
Copy Markdown

vancem commented Mar 12, 2018

By the way both the libuv and the sockets (before or after) show that we only get to 85% CPU utilization.

  • libuv: [03:23:17.656] Max CPU (%): 85
  • after [03:19:41.107] Max CPU (%): 82

That is not where we want to be. We want to be at 100% (unless we know that we are network constrained?) I did not think you could do that with the JSON benchmark.

More things to look at.

@halter73
Copy link
Copy Markdown
Member Author

halter73 commented Mar 12, 2018

@vancem I agree that the perf improvement probably has something to do with the reduced context switching. I also think giving subsets of connections their own IO queues reduces contention.

@vancem
Copy link
Copy Markdown

vancem commented Mar 12, 2018

I am not actually suggesting fewer context switches are the reason for the improved performance. In fact I would expect MORE context switching in the new approach. I am suggesting that it is because by ONLY doing the OS part of the read (which is a big chunk of the total processing), you allow those I/O threads to SPECIALIZE in doing just that (their L1 caches have all the OS code and data needed to do that), and thus run faster (since memory latency is often the limiting factor for CPU performance).

You can go further (as you have) and break the pipeline into more 'chunks' where each CPU can 'specialize in doing one chunk. however the problem is that the task scheduler is not 'smart' about scheduling these chunks, which will 'ruin' the cache that was built up (by the way you would also expect results to vary depending on the size of the machine caches (which doe vary from machine to machine)).

I am thinking about ways of both validating and maximizing the good effect here. Basically make the threadpool 'smart' about allowing threads to specialize in doing one particular thing.

@halter73 halter73 deleted the halter73/socket-coalesce branch March 12, 2018 23:16
@benaadams
Copy link
Copy Markdown
Contributor

benaadams commented Mar 12, 2018

Was speculating on this; obv from a blackbox level... best I could come up with was

With immediate dispatching there's mostly likely a thread waiting for the IO so the data can be immediately handed over to user-space.

When the IO thread handles some of the request processing there isn't a thread waiting so the kernel needs to buffer the data then do an additional kernel->user mode copy when the thread comes back.

There would be an additional cost from thread switching and potentially cold caches from the additional dispatch; however the direct to user mode data might more than make up for this? Meltdown and Spectre probably haven't helped.

Just an idea...

@vancem
Copy link
Copy Markdown

vancem commented Mar 13, 2018

It is easy enough to validate some of these things with ETW. We will know context switches and we can get DCache miss rates, and test in scnearios where more or less 'specialization' by thread/processor is possible. I am playing with this now...

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants