Reduce Task and async method overheads by stephentoub · Pull Request #9471 · dotnet/coreclr

stephentoub · 2017-02-09T23:11:10Z

This PR attempts to reduce the overhead associated with executing Tasks and invoking async methods.

Microbenchmarks look pretty good:

Repeatedly creating an AsyncTaskMethodBuilder<bool> and completing it "synchronously" (before accessing its Task) with a known cached value: 1.95x
Repeatedly creating an AsyncTaskMethodBuilder<int> and completing it "synchronously" (before accessing its Task) with a known non-cached value: 1.53x
Repeatedly creating an AsyncTaskMethodBuilder<int> and completing it "asynchronously" (after accessing its Task): 1.38x
Repeatedly invoking an empty async Task<bool> method: 1.26x
Repeatedly creating and completing a TaskCompletionSource(): 1.37x
Repeatedly creating and completing a TaskCompletionSource(object): 1.29x
Having ProcessorCount worker threads continually queueing a task to its local queue, popping/executing it: 1.06x
Having ProcessorCount worker threads continually queueing a task to the global queue, dequeueing/executing one: Fluctuated within noise +/- 1-2%
Long chain of synchronous continuations: Fluctuated within noise +/- 1-2%
Long chain of asynchronous continuations: Fluctuated within noise +/- 1-2%

(I'd like to hold off on merging though until doing some more verification on ASP.NET workloads to make sure they're at least not negatively affected.)

cc: @jkotas, @kouvel, @benaadams

stephentoub · 2017-02-09T23:11:32Z

@benaadams, would you mind giving this a go with your various test suites (microbenchmarks and ASP.NET) to see how well this does for you?

benaadams · 2017-02-09T23:30:01Z

Good timing; have test-bed already setup and running 😄

stephentoub · 2017-02-09T23:37:57Z

Good timing; have test-bed already setup and running

😄

benaadams · 2017-02-10T01:17:41Z

Don't know how much its used but line 2330 in TaskFactory.cs could change from

if (Interlocked.CompareExchange(ref m_firstTaskAlreadyCompleted, 1, 0) == 0)

to?

if (Interlocked.Exchange(ref m_firstTaskAlreadyCompleted, 1) == 0)

benaadams · 2017-02-10T01:22:57Z

Definitely hasn't regressed; peaking up to 3.3GiB/s (2 x D15 v2 - 20 Cores, 2 Numa, Win <-> Linux)

Requests/sec: 2,478,571.74
Requests/sec: 2,561,367.13

I get the odd run that locks at ~1.9GiB/s but was getting that before this change. Is probably an entire server rack complaining about their noisy neighbour 😄

Will pull some traces

davidfowl · 2017-02-10T01:29:09Z

/cc @mikeharder

benaadams · 2017-02-10T01:50:00Z

Plaintext Traces https://aoa.blob.core.windows.net/aspnet/HttpTest-Task.zip (97MB)
All dlls should also be there

Sampling trace in HttpTest-TaskSampling.dtp (209,135,165 requests)
Call count trace in HttpTest-TaskTracing.dtp (1,458,706 requests)
Code in HttpTest.zip - though likely none of the project settings agree with the libs I'm using 😄
Exe is HttpTest.exe listens on port 80

Will try the task and treadpool tests in morning.

benaadams · 2017-02-10T02:02:34Z

We'll likely merge clash in ThrowHelper #9473

benaadams · 2017-02-10T05:32:00Z

AsyncMethodBuilder.cs line 888

Debugger.NotifyOfCrossThreadDependency(); can't be inlined because Debugger:NotifyOfCrossThreadDependencySlow is marked as NoInlining; which in mscorlib means it has a StackCrawMark which means its caller also won't be inlined. This bail is then baked into the NGen image.

Have been handling this with an extra call to separate NoInline one more call away. (e.g. in ThrowHelper)

As there are already two calls might be worth removing the IsAttached test from NotifyOfCrossThreadDependency adding a comment about NoInining being weird and then move the Debugger.IsAttached test into AsyncMethodBuilderCore.GetCompletionAction.

e.g.

if (Debugger.IsAttached)
    Debugger.NotifyOfCrossThreadDependency();

benaadams · 2017-02-10T05:39:16Z

Hmm... is public, might have to leave the inner Debugger.IsAttached test in and just add the if to GetCompletionAction; or add a second buffer function between the two functions in Debugger; don't think it will hugely effect debugging speed 😛

benaadams · 2017-02-10T06:23:13Z

Raised PR for Debugger #9490

- Slim down AsyncTaskMethodBuilder`1.Task and make it aggressively inlined. The common case we care to optimize for is synchronous completion, in which case the first access to Task will already fine m_task non-null, and we want that access inlined. - Slim down AsyncTaskMethodBuilder`1.SetResult. As with Task, we care most about the synchronous completion path, as when perf matters, most async method invocations complete synchronously. The code path is streamlined so that the synchronous completion path can mostly be inlined. - Replace some throws with ThrowHelper - Mark GetTaskForResult as aggressive inlining. It contains a lot of C# code, but for a given TResult, the JIT trims away the majority of the implementation.

- Remove some unecessary intermediate functions, e.g. Execute(), manually inlining it into its one caller - Make the common completion path (no exceptions, no cancellation, etc.) more inlineable, and avoid calling some functions (like AddExceptionsFromChildren) when we know they will be nops. - Make FinishContinuations inlineable. When there aren't any continuations, this shaves off a measurable percentage of time. When there are, we're no worse off, as the FinishContinuations entrypoint gets inlined, so we still only have the one function call to RunContinuations. - Make TaskCompletionSource.TrySetResult more inlineable. It was just on the cusp, with an extra branch / call to IsCompleted putting it over the edge. But the common path for calling TrySetResult is when the call will successfully transition, in which case we don't need the IsCompleted call; it's only necessary if/when we lose the race condition, in which case we can pay a bit more to call SpinUntilCompleted. - Avoid some duplicate logging-related calls - Remove AggressiveInlining from an ETW-related method that did not need it; the call sites can instead just check IsEnabled before calling it. - Remove some unnecessary writes, casts, locals, etc., make some fields readonly - Change CompareExchange to Exchange in Task.WhenAny completion

stephentoub · 2017-02-11T13:32:47Z

Thanks, @benaadams and @kouvel.

benaadams · 2017-02-11T13:51:05Z

@stephentoub will do some longer runs today with the latest merges; and the Task tests

stephentoub · 2017-02-11T13:51:42Z

Thanks, @benaadams.

benaadams · 2017-02-11T15:46:49Z

Running tests on updated libs https://aoa.blob.core.windows.net/aspnet/HttpTest-Task2a.zip (106MB)
HttpTest.exe plaintext server (port 80)
ThreadPoolTaskTesting.exe task tests

Will post traces when finished; using the .ni from release

benaadams · 2017-02-11T22:22:02Z

@stephentoub sampling traces

ThreadPoolTaskTesting2.dtp https://aoa.blob.core.windows.net/aspnet/ThreadPoolTaskTesting2.zip (34MB)

HttpTest-Task2a.dtp https://aoa.blob.core.windows.net/aspnet/HttpTest-Task2b.zip (1.2MB)

  40 threads and 1000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     5.67ms    6.98ms 488.99ms   93.15%

  7,524,482,848 requests in 60.00m, 0.90TB read
  Socket errors: connect 19, read 1876, write 0, timeout 0

stephentoub · 2017-02-11T22:26:29Z

Thanks for running these, Ben. What am I looking at here? Are these numbers better? worse? same?

benaadams · 2017-02-11T23:29:01Z

Long running execution samples to see what's being executed. The workloads are definitely better :)

There are some slight regressions when very many threads end up queuing batches to the global threadpool; but they aren't consistent. As discussed previously that looks like increased contention from the essentially empty tasks going even faster, rather than a practical issue. One thread creating single threadpool/Task work on 20 core VM was always an outliner (1/4 speed of rest) and that seems to have gone. 💯

ASP.NET workload looks good.

benaadams · 2017-02-11T23:47:27Z

LGTM

davidfowl · 2017-02-12T00:19:21Z

Ideally we'd run our benchmarks as well before the ASP.NET workload is considered "good".

@benaadams it would be good to understand what kinds of tests you're running and what spectrum of things are being covered. If you have some that we're missing, we should get them into our benchmarks.

stephentoub · 2017-02-12T00:58:23Z

Ideally we'd run our benchmarks as well before the ASP.NET workload is considered "good".

I've been running my own that test a spectrum of things that hit these code paths, in addition to the ASP.NET plaintext and JSON workloads. Ben's been helpful in providing additional verification. What other scenarios are you concerned this will regress, David?

davidfowl · 2017-02-12T02:15:47Z

I don't have any problems with this change in particular. It be cool if we were all running the same set of tests. Also I'd like it if somebody from our team also ran benchmarks for changes where you think it may affect our scenarios. More eyes on it is a good thing.

PS: Ben I do trust you 😁

stephentoub · 2017-02-12T02:22:38Z

Also I'd like it if somebody from our team also ran benchmarks for changes where you think it may affect our scenarios

I would like that, too.

benaadams · 2017-02-12T02:31:25Z

it would be good to understand what kinds of tests you're running

Am testing 12 different ways to call and use Tasks, with awaiting, continuations, fanout, chains and direct returns. Via both the threadpool and dedicated threads. Also calling QUWI from both dedicated threads and threadpool. These are tested being called serially and in parallel (x1, x2, x16, x64, x512) and at different recursion depths (x1, x16, x64, x512).

In total 650 different tests ((12+1) x 2 x 5 x 5); which are then run in batches of 262,144 iterations (then pause for full GC)

The ThreadPoolTaskTesting2.dtp sampling profile above was run on a 20 core 2 Numa node D15 v2, and is 132MB unzipped for 17,039,360,000 total test cycles taking about 6 hrs.

The HttpTest-Task2a.dtp sampling profile above is a simple plaintext and was run between two of the same VMs for an hour for 7,524,482,848 requests (approx 1 TByte of plaintext 🙀 responses), throughput did decline slowly over the profiling.

Though the traces are mostly for identifying the hot methods than the actual performance.

I also run the Task tests without profiling on 4 core, 8(16HT) core and 20 core VM. The plaintext test on the VM which is where it is bursting up to 3.3.GiB/s for requests/sec: 2,561,367.13; though I am testing over shared networking between two VMs in two different Azure subscriptions and it does vary between runs.

The ASP.NET test is particularly important because as the code paths for the global threadpool have become more efficient; it has increased contention on it for the zero workload tests, with the contention becoming stronger as the core count increases and across Numa nodes.

Adding a workload to the threadpool items decreases this contention and is more real world; with the ASP.NET being a very relevant workload to use to do this.

Plaintext is the shortest task workload to check the contention "relief" on; the others would mostly be in their own code and decrease the contention much more significantly.

The simple plaintext server code was included in earlier bundle in thread though is https://aoa.blob.core.windows.net/aspnet/HttpTest-Code.zip
The Task tests are on github https://github.com/benaadams/ThreadPoolTaskTesting

The binaries for coreclr and corefx from current source, aspnet from myget and the two executables are in earlier comment #9471 (comment) (also help with correlating the traces to source)

benaadams · 2017-02-12T02:39:02Z

Also I'd like it if somebody from our team also ran benchmarks for changes where you think it may affect our scenarios

I would like that, too.

I'd like that too 😄

mikeharder · 2017-02-13T23:17:34Z

I will run Plaintext before and after this change. Are we expecting any change in throughput?

stephentoub · 2017-02-13T23:20:55Z

Are we expecting any change in throughput?

I wouldn't expect any significant impact; plaintext doesn't use tasks much, and the impact here is mostly on the overhead associated with task execution. It's possible that the combination of this PR, #9562, and #9560 might move the needle slightly, as together they'll end up reducing the overhead associated with async methods, but I wouldn't expect anything huge.

mikeharder · 2017-02-14T00:42:36Z

Before: 2.0.0-beta-001559-00 (2/10/17 21:38 UTC)
Commit: 2/12/17 00:58 UTC
After: 2.0.0-beta-001577-00 (2/13/17 19:09 UTC)

benaadams · 2017-02-14T01:21:19Z

There are a few other related changes like the no-line vs stackcrawl mark, it may be easier to compare before to current rather than try to pickout the individual commits

…

On 14 February 2017 at 00:42, Mike Harder ***@***.***> wrote: Before: 2.0.0-beta-001559-00 (2/10/17 21:38 UTC) Commit: 2/12/17 00:58 UTC After: 2.0.0-beta-001577-00 (2/13/17 19:09 UTC) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9471 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABFwrl2SNuNloM5QTwQf4_inRL1XZFeWks5rcPiOgaJpZM4L8ye4> .

mikeharder · 2017-02-14T01:24:56Z

1577 is currently the latest build at https://dotnet.myget.org/feed/dotnet-core/package/nuget/Microsoft.NETCore.App

mikeharder · 2017-02-14T01:38:43Z

No significant difference between 1559 and 1577 for Plaintext.

Reduce Task and async method overheads Commit migrated from dotnet/coreclr@2b7a76c

dnfclas added the cla-already-signed label Feb 9, 2017

kouvel approved these changes Feb 10, 2017

View reviewed changes

stephentoub added 2 commits February 11, 2017 07:37

stephentoub force-pushed the task_perf branch from 41c7282 to f45a336 Compare February 11, 2017 12:38

stephentoub merged commit 2b7a76c into dotnet:master Feb 12, 2017

stephentoub deleted the task_perf branch February 12, 2017 00:58

jkotas mentioned this pull request Feb 12, 2017

Port CoreCLR threading perf improvements dotnet/corert#2640

Open

karelz modified the milestone: 2.0.0 Aug 28, 2017

stephentoub added the netfx-port-consider label Oct 17, 2017

benaadams mentioned this pull request Aug 23, 2019

Make "async ValueTask/ValueTask<T>" methods ammortized allocation-free #26310

Merged

picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022

Merge pull request dotnet/coreclr#9471 from stephentoub/task_perf

e886d72

Reduce Task and async method overheads Commit migrated from dotnet/coreclr@2b7a76c

Conversation

stephentoub commented Feb 9, 2017

Uh oh!

stephentoub commented Feb 9, 2017

Uh oh!

benaadams commented Feb 9, 2017

Uh oh!

stephentoub commented Feb 9, 2017

Uh oh!

benaadams commented Feb 10, 2017

Uh oh!

benaadams commented Feb 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidfowl commented Feb 10, 2017

Uh oh!

benaadams commented Feb 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benaadams commented Feb 10, 2017

Uh oh!

benaadams commented Feb 10, 2017

Uh oh!

benaadams commented Feb 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benaadams commented Feb 10, 2017

Uh oh!

stephentoub commented Feb 11, 2017

Uh oh!

benaadams commented Feb 11, 2017

Uh oh!

stephentoub commented Feb 11, 2017

Uh oh!

benaadams commented Feb 11, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benaadams commented Feb 11, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stephentoub commented Feb 11, 2017

Uh oh!

benaadams commented Feb 11, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benaadams commented Feb 11, 2017

Uh oh!

davidfowl commented Feb 12, 2017

Uh oh!

stephentoub commented Feb 12, 2017

Uh oh!

davidfowl commented Feb 12, 2017

Uh oh!

stephentoub commented Feb 12, 2017

Uh oh!

benaadams commented Feb 12, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benaadams commented Feb 12, 2017

Uh oh!

mikeharder commented Feb 13, 2017

Uh oh!

stephentoub commented Feb 13, 2017

Uh oh!

mikeharder commented Feb 14, 2017

Uh oh!

benaadams commented Feb 14, 2017 via email

Uh oh!

mikeharder commented Feb 14, 2017

Uh oh!

mikeharder commented Feb 14, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

benaadams commented Feb 10, 2017 •

edited

Loading

benaadams commented Feb 10, 2017 •

edited

Loading

benaadams commented Feb 10, 2017 •

edited

Loading

benaadams commented Feb 11, 2017 •

edited

Loading

benaadams commented Feb 11, 2017 •

edited

Loading

benaadams commented Feb 11, 2017 •

edited

Loading

benaadams commented Feb 12, 2017 •

edited

Loading