SuperCharged MemoryPoolIterator by benaadams · Pull Request #1138 · aspnet/KestrelHttpServer

benaadams · 2016-10-03T07:05:34Z

Is a branchless cache-oblivious evaluation of the set byte; rather than the current multiple ternary operator quasi tree. Around x 2.8 faster than current for Seek operations.

The De Bruijn alternative improvement that was evaluated needs a table lookup of 8 * 8 * bytes = 512 bytes in the data cache to be fast - which is 8 cache lines on x64. (It can be compressed to 24 bytes with some extra per lookup ops).

Whereas this uses a single 64bit const already in the instruction cache (which the De Bruijn also needs).

Also trims FindFirstEqualByte from 318 bytes of asm to 94 bytes of asm (before the final inline) so better for inlining which avoids the vector copy (Added inline due to issue dotnet/coreclr#7386)

Now also contains related Vectorization improvements; single core header parsing speeds (weak cpu) as follows

                   Method |          Mean |      StdDev | Scaled |          RPS |
------------------------- |-------------- |------------ |------- |------------- |
           ParsePlaintext |   792.7620 ns |  15.9475 ns |   1.00 | 1,261,412.60 |
  ParsePipelinedPlaintext |   622.7526 ns |   5.7970 ns |   0.79 | 1,605,773.99 |

Using Multithreaded test from #1222, Results from 16 Core (32 HT) Server as follows:

                   Method |          Mean |      StdDev | Scaled |           RPS |
------------------------- |-------------- |------------ |------- |-------------- |
           ParsePlaintext |   188.0574 ns |   1.5506 ns |   1.00 |  5,317,526.25 |
  ParsePipelinedPlaintext |    67.4741 ns |   1.9620 ns |   0.36 | 14,820,495.27 |

Resolves #1129
Resolves #1141
Resolves #1213

benaadams · 2016-10-03T07:25:14Z

Generates the following 91 bytes of asm (without the final inline)

Details

; Assembly listing for method MemoryPoolIterator:LocateFirstFoundByte(byref):int
; Emitting BLENDED_CODE for X64 CPU with SSE2
; optimized code
; rsp based frame
; fully interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T02] (  3,   3  )   byref  ->  rcx        
;  V01 loc0         [V01,T03] (  2,   5  )  simd16  ->  mm0         ld-addr-op
;  V02 loc1         [V02,T00] (  7,  22  )     int  ->  rax        
;  V03 loc2         [V03,T01] (  4,  10  )    long  ->  rdx        
;  V04 loc3         [V04,T05] (  2,   2  )     int  ->  rdx        
;  V05 tmp0         [V05,T04] (  2,   4  )  simd16  ->  mm0        
;  V06 OutArgs      [V06    ] (  1,   1  )  lclBlk (32) [rsp+0x00]  
;  V07 rat0         [V07    ] (  1,   1  )  simd16  ->  [rsp+0x28]   do-not-enreg[XS] must-init addr-exposed
;
; Lcl frame size = 56

G_M61145_IG01:
       4883EC38             sub      rsp, 56
       33C0                 xor      rax, rax
       4889442428           mov      qword ptr [rsp+28H], rax
       4889442430           mov      qword ptr [rsp+30H], rax

G_M61145_IG02:
       0F1001               movups   xmm0, xmmword ptr [rcx]
       33C0                 xor      eax, eax

G_M61145_IG03:
       83F802               cmp      eax, 2
       733B                 jae      SHORT G_M61145_IG06
       0F11442428           movups   xmmword ptr [rsp+28H], xmm0
       488B54C428           mov      rdx, qword ptr [rsp+8*rax+28H]
       4885D2               test     rdx, rdx
       7507                 jne      SHORT G_M61145_IG04
       FFC0                 inc      eax
       83F802               cmp      eax, 2
       7CE5                 jl       SHORT G_M61145_IG03

G_M61145_IG04:
       488BCA               mov      rcx, rdx
       48F7D9               neg      rcx
       4823CA               and      rcx, rdx
       48BAE0C0A08060402000 mov      rdx, 0x20406080A0C0E0
       480FAFCA             imul     rcx, rdx
       48C1E93D             shr      rcx, 61
       8BD1                 mov      edx, ecx
       8D04C2               lea      eax, [rdx+8*rax]

G_M61145_IG05:
       4883C438             add      rsp, 56
       C3                   ret      

G_M61145_IG06:
       E8A6598D5F           call     CORINFO_HELP_RNGCHKFAIL
       CC                   int3     

; Total bytes of code 91, prolog size 16 for method MemoryPoolIterator:LocateFirstFoundByte(byref):int

Previously generated the following 318 bytes of asm

Details

Successfully inlined Vector:AsVectorInt64(struct):struct (7 IL bytes) (depth 1) [below ALWAYS_INLINE size]
**************** Inline Tree
Inlines into 06000166 MemoryPoolIterator:FindFirstEqualByte(byref):int
  [0 IL=0008 TR=000271 06000167] [FAILED: too many il bytes] MemoryPoolIterator:FindFirstEqualByteSlow(byref):int
  [1 IL=0020 TR=000016 0600009D] [below ALWAYS_INLINE size] Vector:AsVectorInt64(struct):struct
  [0 IL=0176 TR=000263 06002C13] [FAILED: unprofitable inline] InvalidOperationException:.ctor():this
Budget: initialTime=606, finalTime=606, initialBudget=6060, currentBudget=6060
Budget: initialSize=4280, finalSize=4280
; Assembly listing for method MemoryPoolIterator:FindFirstEqualByte(byref):int
; Emitting BLENDED_CODE for X64 CPU with SSE2
; optimized code
; rsp based frame
; fully interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T02] (  4,   3  )   byref  ->  rsi        
;  V01 loc0         [V01,T05] (  2,   4.5)  simd16  ->  mm0         ld-addr-op
;  V02 loc1         [V02,T00] (  7,  21  )     int  ->  rcx        
;  V03 loc2         [V03,T01] (  9,  11.5)    long  ->  rax        
;  V04 tmp0         [V04,T07] (  3,   1.5)     int  ->  rcx        
;  V05 tmp1         [V05,T08] (  3,   1.5)     int  ->  rcx        
;  V06 tmp2         [V06,T09] (  3,   1.5)     int  ->  rcx        
;  V07 tmp3         [V07,T03] (  9,   4.5)     int  ->  rcx        
;  V08 tmp4         [V08,T04] (  9,   4.5)     int  ->  rdx        
;  V09 tmp5         [V09,T10] (  3,   1.5)     int  ->  rcx        
;  V10 tmp6         [V10,T11] (  3,   1.5)     int  ->  rcx        
;  V11 tmp7         [V11,T12] (  3,   1.5)     int  ->  rcx        
;  V12 tmp8         [V12,T13] (  3,   1.5)     int  ->  rcx        
;  V13 tmp9         [V13,T14] (  3,   0  )     ref  ->  rsi        
;  V14 tmp10        [V14,T06] (  2,   2  )  simd16  ->  mm0        
;  V15 OutArgs      [V15    ] (  1,   1  )  lclBlk (32) [rsp+0x00]  
;  V16 rat0         [V16    ] (  1,   1  )  simd16  ->  [rsp+0x20]   do-not-enreg[XS] must-init addr-exposed
;
; Lcl frame size = 48

G_M48472_IG01:
       push     rsi
       sub      rsp, 48
       xor      rax, rax
       mov      qword ptr [rsp+20H], rax
       mov      qword ptr [rsp+28H], rax
       mov      rsi, rcx

G_M48472_IG02:
       mov      rcx, 0xD1FFAB1E
       mov      edx, 3
       call     CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE
       movzx    rcx, byte  ptr [reloc classVar[0xd1ffab1e]]
       test     ecx, ecx
       jne      SHORT G_M48472_IG04
       mov      rcx, rsi
       call     MemoryPoolIterator:FindFirstEqualByteSlow(byref):int
       nop      

G_M48472_IG03:
       add      rsp, 48
       pop      rsi
       ret      

G_M48472_IG04:
       movups   xmm0, xmmword ptr [rsi]
       xor      ecx, ecx

G_M48472_IG05:
       cmp      ecx, 2
       jae      G_M48472_IG17
       movups   xmmword ptr [rsp+20H], xmm0
       mov      rax, qword ptr [rsp+8*rcx+20H]
       test     rax, rax
       je       G_M48472_IG15
       shl      ecx, 3
       mov      edx, 0xD1FFAB1E
       and      rdx, rax
       test     rdx, rdx
       jg       SHORT G_M48472_IG09
       mov      rdx, 0xD1FFAB1E
       and      rdx, rax
       test     rdx, rdx
       jg       SHORT G_M48472_IG07
       mov      rdx, 0xD1FFAB1E
       and      rax, rdx
       test     rax, rax
       jg       SHORT G_M48472_IG06
       mov      edx, 7
       jmp      SHORT G_M48472_IG13

G_M48472_IG06:
       mov      edx, 6
       jmp      SHORT G_M48472_IG13

G_M48472_IG07:
       mov      rdx, 0xD1FFAB1E
       and      rax, rdx
       test     rax, rax
       jg       SHORT G_M48472_IG08
       mov      edx, 5
       jmp      SHORT G_M48472_IG13

G_M48472_IG08:
       mov      edx, 4
       jmp      SHORT G_M48472_IG13

G_M48472_IG09:
       mov      rdx, rax
       and      rdx, 0xFFFF
       test     rdx, rdx
       jg       SHORT G_M48472_IG11
       and      rax, 0xD1FFAB1E
       test     rax, rax
       jg       SHORT G_M48472_IG10
       mov      edx, 3
       jmp      SHORT G_M48472_IG13

G_M48472_IG10:
       mov      edx, 2
       jmp      SHORT G_M48472_IG13

G_M48472_IG11:
       and      rax, 255
       test     rax, rax
       jg       SHORT G_M48472_IG12
       mov      edx, 1
       jmp      SHORT G_M48472_IG13

G_M48472_IG12:
       xor      edx, edx

G_M48472_IG13:
       lea      eax, [rcx+rdx]

G_M48472_IG14:
       add      rsp, 48
       pop      rsi
       ret      

G_M48472_IG15:
       inc      ecx
       cmp      ecx, 2
       jl       G_M48472_IG05

G_M48472_IG16:
       mov      rcx, 0xD1FFAB1E
       call     CORINFO_HELP_NEWSFAST
       mov      rsi, rax
       mov      rcx, rsi
       call     InvalidOperationException:.ctor():this
       mov      rcx, rsi
       call     CORINFO_HELP_THROW
       int3     

G_M48472_IG17:
       call     CORINFO_HELP_RNGCHKFAIL
       int3     

; Total bytes of code 318, prolog size 20 for method MemoryPoolIterator:FindFirstEqualByte(byref):int

benaadams · 2016-10-03T09:38:05Z

Benchmark code https://gist.github.com/benaadams/2dd3f99230757111e91915f638067a09

Host Process Environment Information:
BenchmarkDotNet.Core=v0.9.9.0
OS=Windows
Processor=?, ProcessorCount=4
Frequency=2825621 ticks, Resolution=353.9045 ns, Timer=TSC
CLR=CORE, Arch=64-bit ? [RyuJIT]
GC=Concurrent Workstation
dotnet cli version: 1.0.0-preview2-003131

Type=FirstByteBenchmark  Mode=Throughput  Platform=X64
Jit=RyuJit  Toolchain=Core  Runtime=Core
LaunchCount=3  WarmupCount=5  TargetCount=10

Method	ByteSet	Median	StdDev	Scaled
FindFirstByteNaive	1	4.1516 ns	0.1021 ns	0.78
FirstByteCurrent	1	5.3134 ns	0.0728 ns	1.00
BensMagicNumberLoop	1	1.7779 ns	0.2667 ns	0.34
DeBruijnXorLoop	1	2.0723 ns	0.1532 ns	0.40
DeBruijnClassicUnroll	1	1.4574 ns	0.0413 ns	0.27
DeBruijnXorUnroll	1	1.4700 ns	0.0508 ns	0.28
DeBruijnXorRevisedUnroll	1	1.7636 ns	0.0399 ns	0.33
BensMagicNumberRevisedUnroll	1	1.5175 ns	0.1818 ns	0.29
FindFirstByteNaive	7	12.9161 ns	2.5250 ns	3.46
FirstByteCurrent	7	3.8624 ns	0.0681 ns	1.00
BensMagicNumberLoop	7	1.7570 ns	0.0746 ns	0.46
DeBruijnXorLoop	7	2.0868 ns	0.0721 ns	0.54
DeBruijnClassicUnroll	7	1.4678 ns	0.0461 ns	0.38
DeBruijnXorUnroll	7	1.4644 ns	0.0460 ns	0.38
DeBruijnXorRevisedUnroll	7	1.7503 ns	0.1769 ns	0.46
BensMagicNumberRevisedUnroll	7	1.4517 ns	0.0567 ns	0.38
FindFirstByteNaive	8	14.3492 ns	0.2209 ns	2.02
FirstByteCurrent	8	7.0606 ns	0.0987 ns	1.00
BensMagicNumberLoop	8	2.4293 ns	0.0699 ns	0.34
DeBruijnXorLoop	8	2.6853 ns	0.1950 ns	0.39
DeBruijnClassicUnroll	8	2.3295 ns	0.0429 ns	0.33
DeBruijnXorUnroll	8	2.3574 ns	0.0432 ns	0.33
DeBruijnXorRevisedUnroll	8	1.7484 ns	0.0508 ns	0.25
BensMagicNumberRevisedUnroll	8	1.4971 ns	0.2202 ns	0.22
FindFirstByteNaive	9	15.5937 ns	0.2663 ns	2.22
FirstByteCurrent	9	7.0356 ns	0.1400 ns	1.00
BensMagicNumberLoop	9	2.4730 ns	0.1458 ns	0.35
DeBruijnXorLoop	9	2.6880 ns	2.4056 ns	0.45
DeBruijnClassicUnroll	9	2.3542 ns	1.8773 ns	0.39
DeBruijnXorUnroll	9	2.4133 ns	1.8786 ns	0.39
DeBruijnXorRevisedUnroll	9	1.7802 ns	0.1162 ns	0.26
BensMagicNumberRevisedUnroll	9	1.4618 ns	2.2693 ns	0.27
FindFirstByteNaive	15	24.5206 ns	0.3604 ns	4.37
FirstByteCurrent	15	5.6263 ns	0.1159 ns	1.00
BensMagicNumberLoop	15	2.4115 ns	0.0593 ns	0.43
DeBruijnXorLoop	15	2.7119 ns	0.0397 ns	0.49
DeBruijnClassicUnroll	15	2.3815 ns	0.0705 ns	0.42
DeBruijnXorUnroll	15	2.3599 ns	0.1406 ns	0.43
DeBruijnXorRevisedUnroll	15	1.7565 ns	0.0522 ns	0.32
BensMagicNumberRevisedUnroll	15	1.4605 ns	0.0845 ns	0.26

benaadams · 2016-10-03T09:43:14Z

While the unrolled version is faster, it is also inlined into 16 places so will be duplicated 16 times so I've gone for the loop version.

mburbea · 2016-10-03T13:34:47Z

Wow. Where did you find this constant? How does this work?

benaadams · 2016-10-03T14:03:38Z

I couldn't believe there wasn't a number in the 2^64 space that did this. However, after exhaustive internet searching I couldn't find one. So I went to sleep, woke up in the middle of the night with a headache and idea of a method to try and actually derived it.

Having derived it, I did some searching with the search engines and couldn't find any hits either in its decimal or hexadecimal from so called it Ben's Magic Number 😺

I will write a blog post on the method I used.

MaximRouiller · 2016-10-03T18:48:22Z

@benaadams Dude... you need a blog post on how your brain works. Might help cure cancer or something.

benaadams · 2016-10-03T19:12:07Z

I think I can apply this to < Vector.Length also

halter73 · 2016-10-03T19:34:01Z

Really cool. 👍 I look forward to reading the blog post.

This doesn't seem to be doing exactly what FindFirstEqualByte or the debrujin sequence was doing, but it's probably all we need. BensMagicNumberFindByte only gives correct results if the least significant bit of the first non-zero byte is set.

I'm assuming/hoping that it's guaranteed that the return value of Vector<T>.Equals(left, right) for matching bytes will have the LSB set for bytes in matching positions. IIRC, the bytes in the matching position of the return value are always 255, and 0 (obviously) in non matching matching position. The docs leave a bit to be desired by only stating that "Returns a new vector of a specified type whose elements signal whether the elements in two specified vectors of the same type are equal.".

Here are some sample inputs that show the difference. The old FindFirstEqualByte would return 7 for all these inputs. But none of the functional tests failed, so that's a great sign 😄

> BensMagicNumberFindByte(0x0100000000000000)
7
> BensMagicNumberFindByte(0x0200000000000000)
6
> BensMagicNumberFindByte(0x0400000000000000)
4
> BensMagicNumberFindByte(0x0800000000000000)
0
> BensMagicNumberFindByte(0x1000000000000000)
0
> BensMagicNumberFindByte(0x2000000000000000)
0
> BensMagicNumberFindByte(0x4000000000000000)
0
> BensMagicNumberFindByte(0x8000000000000000)
0

benaadams · 2016-10-03T19:49:41Z

@halter73 its specifically testing the least significant bit which is set for the common values of true -1 or 255 in this case - or more specifically not zero = ~0; with 0 being false.

halter73 · 2016-10-03T23:05:12Z

+        //
+        // This means when they are next used the have been pre-evaluated and the jit can 
+        // either directly embed them or use them for branch elimiation.
+        internal static bool StaticReadonlysJitted;


Why is this required to turn the static readonlies into jitted const? Is this a bug in the jitter?

The jit does the work when it first encounters the item. For readonly statics in a beforefieldinit class its when any static variable is first used which is documented.

That first function that is being jitted gets none of the benefits of the const like nature of the variables; though every function after that does. One of the features that doesn't happen is branch elimination - like on IsLittleEndian.

I don't know if its expected behaviour or a bug (might need to unwind and recomplile to resolve it). It came up before in #554 (comment)

Ahh the original issue: https://github.com/dotnet/coreclr/issues/2526. Interesting that the CLR team doesn't want to run static ctors prior to jitting. That's too bad.

benaadams · 2016-10-03T23:16:22Z

@halter73 been thinking about the input. I'm fairly confident its safe assuming true is every bit set from an inbuilt; else you couldn't binary not it to get false - you'd still have a form of true.

halter73 · 2016-10-03T23:51:52Z

@benaadams I think it's safe too.

nicodeslandes · 2016-10-04T01:16:11Z

I tried to come up with a similar magic number that would allow to obtain the index after shifting its bits, and found a different one.
Wouldn't this number just do the same and be a bit easier to understand? 0x01020304050607
We know that after doing x=(v & (-v)) we get a power of 2, which when multiplied with the magic number will shift its bits by a multiple of 8 positions:
So after being multiplied by x, 0x0001020304050607 is turned into

0x0304050607000000 when shifted by a 3-byte offset
0x0405060700000000 when shifted by a 4-byte offset
0x0506070000000000 when shifted by a 5-byte offset
and so on...

We then simply have to read the 1st byte of the resulting number to find the offset, ie:
offset = (((v & -v) * Magic) >> 56) & 0x7

benaadams · 2016-10-04T01:48:44Z

@nicodeslandes Congats! That is a lot more obvious and what is happening; and drops a shift. Already superseded; that's the power of OSS collaboration!

nicodeslandes · 2016-10-04T02:33:04Z

@benaadams Sweet! Glad I could bring a modest contribution to this great piece of code.
I do feel bad about the BENS_MAGIC_NUMBER constant going away though. Sorry about that... :-)

benaadams · 2016-10-04T02:35:29Z

@halter73 I've added the function for testing longs as if Vector. (e.g. if vector length is 32, there are sizes 8 - 24 bytes that can be searched using 1 - 3 longs)

However if I add the code to use it will have a horrific merge clash with #1146

halter73 · 2016-10-04T02:54:25Z

Funny. The 0x01020304050607 version is exactly what I explained to people in Redmond. I even wrote that constant on the whiteboard (I know @benaadams saw the tweet). I figured I might be missing something subtle, but that would make it easy for people to get the gist.

Fortunately most of the people I was talking to had a vague understanding of the debrujin code, so they already understood multiplying by (v & (-v)) was a bit shift.

The key insight from there is that the shifts are always 8 bit aligned. So you can make the mask/window up to 8 bits and not worry about any overlapping windows. 8 bits of course being more than enough to represent 0-7. This is why you don't need to use a debrujin sequence for array indexing. If there were overlapping windows, it would be a different story.

I've added the function for testing longs as if Vector...

Is that the SetLowBitForByteMatch that's not being called? Is that supposed to be used by the non hardware accelerated code path? I would put that in a separate PR if your not calling it in this one. How did you come up with it? How does it compare perf-wise to the naive code?

davidfowl · 2016-10-04T03:09:04Z

👏

benaadams · 2016-10-04T03:13:53Z

Is that the SetLowBitForByteMatch that's not being called?

Yes, removed it from this PR, will add back later with measurements.

Is that supposed to be used by the non hardware accelerated code path?

Is to be used for data lengths shorter than Vector; so on a 32 byte wide vector currently the final 31 bytes of a GET request is processed per byte. Will cover other points in the PR (post merge clashes)

halter73 · 2016-10-04T03:21:13Z

@benaadams 👍 I guess that's why there's no else after if (IsHardwareAccelerated). The non-accelerated path get's used near the end of the input either way.

jamesqo · 2016-10-04T16:41:57Z

Nice! 🎉

JosephTremoulet · 2016-10-04T17:47:43Z

+        private static readonly bool IsHardwareAccelerated = Vector.IsHardwareAccelerated;
+        private static readonly bool IsLittleEndian = BitConverter.IsLittleEndian;
        private static readonly int _vectorSpan = Vector<byte>.Count;
+        private static readonly int _vectorUlongSpan = Vector<ulong>.Count;


Were you not seeing Vector.IsHardwareAccelerated and Vector<xxx>.Count being replaced by constants during jiting when used in-line in the method? The code you have now gets them replaced by constants during importing due to an optimization for ldsfld on read-only static fields, but impSIMDIntrinsic is supposed to kick in and replace their references with constants if you use them in-line. Doing the latter should have the benefit that the Jit could recognize that this is where the constants came from so that it could e.g. use that for an unrolling heuristic and unroll the loop in Seek to use constant rather than variable indexing (dotnet/coreclr#7422).

@JosephTremoulet made change, seems to work. BitConverter.IsLittleEndian definately wasn't doing branch elimination though so have left it as an early eval readonly static when it does.

@benaadams Did you run any benchmarks showing setting StaticReadonlysJitted speeds up Seek?

Obviously you would need to "warm up" Seek prior to the benchmark in order to avoid penalizing the non-StaticReadonlysJitted code for the time spent running the static constructors.

@benaadams sounds good, thanks. Also makes me wonder if we should make BitConverter.IsLittleEndian a JIT intrinsic...

@JosephTremoulet Its actually a const in coreclr; but it calls the corefx version which calculates it - might just need a forwarder?

@halter73 without it, it includes both branches and calls the static ctor; then does a memory load from class var every execution in the main flow in the asm

mov rcx, 0xD1FFAB1E mov edx, 3 call CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE movzx rcx, byte ptr [reloc classVar[0xd1ffab1e]] test ecx, ecx jne SHORT G_M48472_IG04 mov rcx, rsi call MemoryPoolIterator:FindFirstEqualByteSlow(byref):int nop

So its an extra branch and long jump with branch per op and probably a volatile read? (just guessing on the cctor test)

Is that measurable in a microbenchmark? If not, I would take the extra branch in the generated asm for cleaner code.

I'll pull it from this PR and submit a new one and grumble about it there... 😉

halter73 · 2016-11-21T22:49:30Z

+    "emitEntryPoint": true,
+    "compile": {
+      "include": [
+        "../shared/SocketInputExtensions.cs",


I just noticed you aren't using "../shared/**/*.cs" like the other test projects. Any reason for that?

Some of them reference xunit so have to pull in a bunch of extra references that aren't used for the globbing

benaadams · 2016-11-21T22:56:29Z

Rebased. What a journey..! I think its too complicated to squash else the "why" of git blame will be missing 😉

benaadams · 2016-11-21T23:45:22Z

Though the real reason is I tried rearranging some things; but caused merge conflicts. Note if you can cause merge conflicts in a single PR it has got too big...

benaadams · 2016-11-22T00:12:30Z

Could squash the clean up ones?

benaadams · 2016-11-22T02:30:09Z

Added Multithreaded test in #1222

Results from 16 Core (32 HT) Server as follows:

                   Method |          Mean |      StdDev | Scaled |           RPS |
------------------------- |-------------- |------------ |------- |-------------- |
           ParsePlaintext |   188.0574 ns |   1.5506 ns |   1.00 |  5,317,526.25 |
  ParsePipelinedPlaintext |    67.4741 ns |   1.9620 ns |   0.36 | 14,820,495.27 |

sivarv · 2016-11-22T02:43:40Z

Any idea how much of TechEmPower numbers (both plaintext and Json ) have moved due to this change?

halter73 · 2016-11-22T22:55:35Z

This comment shows the performance difference in the plaintext benchmark prior to merging the PR. I haven't yet gotten the chance to run the TechEmpower benchmarks after merging.

benaadams · 2018-11-18T07:48:01Z

The saga is complete dotnet/coreclr#21073; end of an era!

The age of CPU intrinsics in C# has begun!

dnfclas added the cla-already-signed label Oct 3, 2016

benaadams mentioned this pull request Oct 3, 2016

Use Ben's Magic Number for FindFirstEqualByte #1136

Closed

benaadams mentioned this pull request Oct 3, 2016

Consider optimizing the use of Vector in MemoryPoolIterator #1129

Closed

halter73 reviewed Oct 3, 2016

View reviewed changes

benaadams force-pushed the bens-magic-number branch from 002481f to 16ea886 Compare October 4, 2016 02:01

benaadams changed the title ~~Use Ben's magic number for FindFirstEqualByte~~ Use PowerOfTwoToHighByte for FindFirstEqualByte Oct 4, 2016

benaadams force-pushed the bens-magic-number branch from 16ea886 to 4bf0a57 Compare October 4, 2016 02:14

benaadams force-pushed the bens-magic-number branch from dfd85c0 to 4bf0a57 Compare October 4, 2016 02:57

JosephTremoulet reviewed Oct 4, 2016

View reviewed changes

benaadams mentioned this pull request Oct 5, 2016

Don't spill Vectors from registers to stack #1146

Closed

benaadams added 12 commits November 21, 2016 22:48

Add Request Parsing benchmark

90c7be1

MemoryPoolIterator byref structs

12e2f30

Don't inline slow paths

4cdcca2

Feedback + Cleanup

7ab4442

wasLastBlock adjustments

5041d6c

Move perf, share code

8ce47fb

Use ternary op rather than lazy and

2011a27

defaullt(MemoryPoolIterator) test coverage

ef5ad3d

Add GetArraySegment test

dc90dd1

Add 100% coverage Take tests

ba0b7cc

Improve Skip coverage

9ec4d88

MemoryPoolIterator feedback

2eba401

halter73 reviewed Nov 21, 2016

View reviewed changes

benaadams force-pushed the bens-magic-number branch from fee8441 to 2eba401 Compare November 21, 2016 22:52

halter73 mentioned this pull request Nov 21, 2016

An optimization to Kestrel Server Seek() method #1141

Closed

halter73 merged commit 2eba401 into aspnet:dev Nov 22, 2016

muratg modified the milestones: 1.2.0, 2.0.0 Jan 12, 2017

benaadams mentioned this pull request Feb 21, 2017

Faster Span dotnet/corefx#16222

Closed

sivarv mentioned this pull request Jan 31, 2020

Optimize byte type comparisons dotnet/runtime#6861

Open

JosephTremoulet mentioned this pull request Jan 31, 2020

Unroll small loops over Vector<T> elements dotnet/runtime#6891

Closed

Conversation

benaadams commented Oct 3, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benaadams commented Oct 3, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benaadams commented Oct 3, 2016

Uh oh!

benaadams commented Oct 3, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mburbea commented Oct 3, 2016

Uh oh!

benaadams commented Oct 3, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaximRouiller commented Oct 3, 2016

Uh oh!

benaadams commented Oct 3, 2016

Uh oh!

halter73 commented Oct 3, 2016

Uh oh!

benaadams commented Oct 3, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benaadams commented Oct 3, 2016

Uh oh!

halter73 commented Oct 3, 2016

Uh oh!

nicodeslandes commented Oct 4, 2016

Uh oh!

benaadams commented Oct 4, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nicodeslandes commented Oct 4, 2016

Uh oh!

benaadams commented Oct 4, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

halter73 commented Oct 4, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidfowl commented Oct 4, 2016

Uh oh!

benaadams commented Oct 4, 2016

Uh oh!

halter73 commented Oct 4, 2016

Uh oh!

jamesqo commented Oct 4, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benaadams Oct 4, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benaadams Oct 4, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

benaadams commented Oct 3, 2016 •

edited

Loading

benaadams commented Oct 3, 2016 •

edited

Loading

benaadams commented Oct 3, 2016 •

edited

Loading

benaadams commented Oct 3, 2016 •

edited

Loading

benaadams commented Oct 3, 2016 •

edited

Loading

benaadams commented Oct 4, 2016 •

edited

Loading

benaadams commented Oct 4, 2016 •

edited

Loading

halter73 commented Oct 4, 2016 •

edited

Loading

benaadams Oct 4, 2016 •

edited

Loading

benaadams Oct 4, 2016 •

edited

Loading

sivarv commented Nov 22, 2016 •

edited

Loading