Concurrent work queues, less native async_send transitions by benaadams · Pull Request #522 · aspnet/KestrelHttpServer

benaadams · 2015-12-23T00:38:06Z

Process queues till complete
Remove the high contention locks
Reduce the calls to native async_send when not required (e.g. loop is currently processing).

From #519

halter73 · 2015-12-31T01:12:46Z

I'm pretty sure there is a race here.

What if we don't call uv_async_send because the loop is apparently not idle when in fact the loop is merely finishing up? It's possible that _workQueue.IsEmpty returns true, but before _loopIdle is actually set to true, new work is enqueued and WakeUpLoop is called.

Thought about this for a while; only performant way I can think of is to hit it with a sledgehammer. Changed how OnPost works.

halter73 · 2016-01-04T20:44:26Z

There are basically two things we need to determine before merging this PR. Whether or not using a concurrent queue is faster than swapping normal queues inside a lock when consuming and always locking when adding.

I think we should separately measure the impact of tracking _loopIdle in managed code to reduce native calls to async_send.

benaadams · 2016-01-04T21:17:30Z

The queue issue is all the writes contending on the lock rather than the swapping; which is probably quite efficient - though the writes also contend with the swap.

The async_send issue is the managed->native transition being expensive as it does stack prep-work; e.g. for Buffer.BlockCopy just calling its native counterpart (https://github.com/dotnet/coreclr/issues/2430#issuecomment-166594959)

There doesn't appear be any transition cost but Buffer::BlockCopy's function prolog takes 11.3%. Not surprising considering that it stores 7 registers to stack - that's 56 bytes of stores in a function that's supposed to copy, say, 16 bytes.

Also validation on the Safehandles, the return values and the jit not being able to optimize the native call to anything better, inline it etc; and that's ignoring whatever libuv actually does with the call

vs checking a bool :)

benaadams · 2016-01-04T21:20:52Z

i.e. on the swapping if it was a single writer and single reader; rather than many writers and one reader then the swap on a regular Queue would be faster than a ConcurrentQueue - but then the lock would also only need to exist on the swap; not on Add

halter73 · 2016-01-12T20:47:22Z

dev:

Running 10s test @ http://10.0.0.100:5001/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.73ms    5.44ms 239.17ms   94.31%
    Req/Sec    36.83k     1.86k   59.48k    85.32%
  11813916 requests in 10.10s, 1.45GB read
Requests/sec: 1169764.49
Transfer/sec:    147.26MB

Running 10s test @ http://10.0.0.100:5001/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.79ms    6.16ms 290.01ms   96.14%
    Req/Sec    36.91k     2.87k   73.88k    95.12%
  11820562 requests in 10.10s, 1.45GB read
Requests/sec: 1170379.09
Transfer/sec:    147.33MB

Running 10s test @ http://10.0.0.100:5001/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.53ms    4.48ms 274.42ms   96.42%
    Req/Sec    36.90k     1.74k   61.62k    85.45%
  11844142 requests in 10.10s, 1.46GB read
Requests/sec: 1172712.52
Transfer/sec:    147.63MB

benaadams/work-queue-merged:

Running 10s test @ http://10.0.0.100:5001/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.39ms    3.16ms  76.18ms   94.44%
    Req/Sec    37.03k     2.65k   78.47k    95.15%
  11860835 requests in 10.10s, 1.46GB read
Requests/sec: 1174346.96
Transfer/sec:    147.83MB

Running 10s test @ http://10.0.0.100:5001/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.30ms    2.64ms 234.58ms   93.71%
    Req/Sec    37.06k     2.87k   76.10k    95.96%
  11864330 requests in 10.10s, 1.46GB read
Requests/sec: 1174694.86
Transfer/sec:    147.88MB

Running 10s test @ http://10.0.0.100:5001/plaintext
  32 threads and 256 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.70ms    6.36ms 154.95ms   95.57%
    Req/Sec    37.04k     3.55k   97.62k    96.54%
  11843404 requests in 10.10s, 1.46GB read
Requests/sec: 1172695.07
Transfer/sec:    147.62MB

The removed locks do seem ever-so-slightly faster, and I personally don't see any race condition in this PR. Still, when I forwarded this PR out to @GrabYourPitchforks (best writer of lock-free code I know) to verify the correctness, his first response is it really worth removing the locks which make this obviously correct. I'm inclined to agree. The locks make me much more warm and fuzzy inside since it's easier for me to reason about than access to a volatile variable.

benaadams · 2016-01-12T20:58:05Z

Was this a comparison of current dev (faster); to branch (older)?

Regardless, this would probably be an unsensible time to merge it*

*As I imagine people will freeze to the RTM in production so as you and @davidfowl had mentioned previously; stability and security (or fixes for bad perf; rather than to make great perf better 😁 )

halter73 · 2016-01-12T22:20:03Z

No. I merged this change into dev and then ran the test.

I'm glad you agree. Like I said, I think this change is good, but I don't want any uncertainty.

dnfclas added the cla-already-signed label Dec 23, 2015

benaadams force-pushed the work-queues branch from afae0fe to d42a615 Compare December 29, 2015 01:21

halter73 reviewed Dec 31, 2015
View reviewed changes

benaadams force-pushed the work-queues branch from d42a615 to 84118b6 Compare January 2, 2016 02:33

Concurrent queues, less native async_send transitions

7de7880

benaadams force-pushed the work-queues branch from 84118b6 to df718c1 Compare January 2, 2016 03:00

Avoid _loopIdle race

9447a0b

benaadams force-pushed the work-queues branch from df718c1 to 9447a0b Compare January 2, 2016 03:40

halter73 added the needs measuring label Jan 4, 2016

halter73 closed this Jan 12, 2016

benaadams deleted the work-queues branch May 10, 2016 02:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concurrent work queues, less native async_send transitions#522

Concurrent work queues, less native async_send transitions#522
benaadams wants to merge 2 commits into
aspnet:devfrom
benaadams:work-queues

benaadams commented Dec 23, 2015

Uh oh!

halter73 Dec 31, 2015

Uh oh!

benaadams Jan 2, 2016

Uh oh!

halter73 commented Jan 4, 2016

Uh oh!

benaadams commented Jan 4, 2016

Uh oh!

benaadams commented Jan 4, 2016

Uh oh!

halter73 commented Jan 12, 2016

Uh oh!

benaadams commented Jan 12, 2016

Uh oh!

halter73 commented Jan 12, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

benaadams commented Dec 23, 2015

Uh oh!

halter73 Dec 31, 2015

Choose a reason for hiding this comment

Uh oh!

benaadams Jan 2, 2016

Choose a reason for hiding this comment

Uh oh!

halter73 commented Jan 4, 2016

Uh oh!

benaadams commented Jan 4, 2016

Uh oh!

benaadams commented Jan 4, 2016

Uh oh!

halter73 commented Jan 12, 2016

Uh oh!

benaadams commented Jan 12, 2016

Uh oh!

halter73 commented Jan 12, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants