Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Vectorize SpanHelpers.IndexOf for byte#17143

Merged
jkotas merged 3 commits into
dotnet:masterfrom
benaadams:span-indexof
Mar 21, 2017
Merged

Vectorize SpanHelpers.IndexOf for byte#17143
jkotas merged 3 commits into
dotnet:masterfrom
benaadams:span-indexof

Conversation

@benaadams
Copy link
Copy Markdown
Member

@benaadams benaadams commented Mar 15, 2017

return (int)(byte*)(index + 7);
#if !netstandard10
VectorLength:
// Already checked, but for jit branch elmination
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this do for the jit branch elimination?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is always false when no vector support so should make the function's asm smaller?

Copy link
Copy Markdown
Member

@jkotas jkotas Mar 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JIT has dead code elimination. When no vector support, the never taken branch at the top should be enough to make this code. I do not think that this extra check can ever help anything.

return (int)(byte*)(index + 7);
#if !netstandard10
VectorLength:
// Already checked, but for jit branch elmination
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: elmination

[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static Vector<byte> GetVector(byte vectorByte)
{
#if !NETCOREAPP
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be defined in the .csproj; and it should be lowercase.

@benaadams
Copy link
Copy Markdown
Member Author

Think I got the csproj changes correct?

@jkotas
Copy link
Copy Markdown
Member

jkotas commented Mar 15, 2017

Think I got the csproj changes correct?

Yes.

@jkotas
Copy link
Copy Markdown
Member

jkotas commented Mar 15, 2017

I have been wondering whether it would be worth it to check for a few matches before we take the vectorized path - to have more balanced performance profile over all possible inputs. Would you mind collecting some numbers about it?

byte[] a = new byte[255];
for (byte i = 0; i < 255; i++) a[i] = i;
Span<byte> span = new Span<byte>(a);

span.IndexOf(x)` <- what is the performance of vectorized vs. non-vectorized IndexOf for small x, from which x will the vectorized implementation start winning?

@jkotas
Copy link
Copy Markdown
Member

jkotas commented Mar 15, 2017

LGTM otherwise.

@benaadams
Copy link
Copy Markdown
Member Author

would be worth it to check for a few matches before we take the vectorized path

Am guessing it will make initial start better, but then the following worse.

E.g. if the match was at byte 9 and the start up checked the first 8 bytes then went vector. The byte 9 match would pay the cost for both the 8 byte individual search and then the vector start up cost.

Also for something like "StartsWith" you check the first byte and subsequent rather than doing an index of?

Will try though...

[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static Vector<byte> GetVector(byte vectorByte)
{
#if !netcoreapp
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't !netcoreapp === netstandard1.0? In which case, why do we need to define netcoreapp as a constant?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

netstandard1.1+ can also be full framework and mono

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix is currently only in netcoreapp jit 1.2.0+

0x03ul << 32 |
0x02ul << 40 |
0x01ul << 48) + 1;
#endif
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This contains nested #if branches where the parent covers a large body of the code. I think we should break it up to surround individual methods instead.

That is,

#if !netstandard10
        // Vector sub-search adapted from https://github.com/aspnet/KestrelHttpServer/pull/1138
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        private static int LocateFirstFoundByte(Vector<byte> match)
        {
              ...
        }
#endif

Would avoid confusion if methods get moved around/etc.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

!netstandard10 covers whether the vectors package is available
!netcoreapp covers a bug in the jit regarding vectors that has been fixed in netcoreapp 1.2; but hasn't been released in full framework; so it is a subset of !netstandard10

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#if'd the individual methods where applicable

@ahsonkhan
Copy link
Copy Markdown

LGTM

@ahsonkhan
Copy link
Copy Markdown

@KrzysztofCwalina, can this be merged?

@KrzysztofCwalina
Copy link
Copy Markdown
Member

Yes. Looks good to me.

@davidfowl
Copy link
Copy Markdown
Member

davidfowl commented Mar 16, 2017

Is this only for fast span?

@jkotas
Copy link
Copy Markdown
Member

jkotas commented Mar 16, 2017

Is this only for fast span?

It is for both.

@davidfowl
Copy link
Copy Markdown
Member

Ok good.

@ahsonkhan
Copy link
Copy Markdown

Is this only for fast span?

Extension methods in the SpanExtensions class only have one implementation (which lives in corefx).

@benaadams
Copy link
Copy Markdown
Member Author

Still think Clear and Fill should move to Extensions rather than instance methods defined on both spans

@ahsonkhan
Copy link
Copy Markdown

I would love to see the numbers on the impact of vectorization for small lengths and whether it would make sense to do a few one-off checks before taking the vectorized path to get more balanced performance profile.

Am guessing it will make initial start better, but then the following worse.
E.g. if the match was at byte 9 and the start up checked the first 8 bytes then went vector. The byte 9 match would pay the cost for both the 8 byte individual search and then the vector start up cost.
Also for something like "StartsWith" you check the first byte and subsequent rather than doing an index of?
Will try though...

Hey @benaadams, will you be providing the requested perf numbers. If you are too busy elsewhere, I am more than happy to run some benchmark tests and provide the requested data. Let me know :)

@KrzysztofCwalina
Copy link
Copy Markdown
Member

@benaadams, I think you are right about the cost of the 9th item, but here is what I am thinking:

  1. The cost for the first item is very low in comparison to setting up vectors, and so vectorized implementation makes it 10x worse.
  2. The cost of searching the 1001 item is high, and so setting up a vector after the 1000th item would be a smaller percentage hit.
  3. Somewhere in between there might be a threshold where the hit is negligible, yet after the threshold the wins are high.

.. just a thought.

@benaadams
Copy link
Copy Markdown
Member Author

K, so... I've had a d'oh moment.

I've not been able to check as have been have some issues with the build (now resolved) and not been abe to build and use coreclr+corefx from src.

However... as they are extension methods I don't actually need to build coreclr+corefx to test 😝

@benaadams
Copy link
Copy Markdown
Member Author

On netcoreapp 1.1 (netcoreapp 2.0 jit has had Vector fixes in this area, but not sure how to benchmark it other than in a frankenbuild, will try in morning)

Vector16

            Method | Position |       Mean |    StdErr |           Op/s | Scaled |
------------------ |--------- |----------- |---------- |--------------- |------- |
 IndexOfSequential |        0 |  8.2329 ns | 0.0172 ns | 121,463,880.98 |   1.00 |
     IndexOfVector |        0 | 12.1565 ns | 0.0303 ns |  82,260,404.53 |   1.48 |
 IndexOfSequential |        1 |  8.4898 ns | 0.0223 ns | 117,788,688.66 |   1.00 |
     IndexOfVector |        1 | 12.2424 ns | 0.0247 ns |  81,683,053.45 |   1.44 |
 IndexOfSequential |        2 |  8.4635 ns | 0.0186 ns | 118,153,863.70 |   1.00 |
     IndexOfVector |        2 | 12.1465 ns | 0.0315 ns |  82,328,349.65 |   1.44 |
 IndexOfSequential |        3 |  8.7908 ns | 0.0222 ns | 113,755,761.89 |   1.00 |
     IndexOfVector |        3 | 12.2335 ns | 0.0271 ns |  81,742,869.64 |   1.39 |
 IndexOfSequential |        4 |  8.8272 ns | 0.0216 ns | 113,286,658.97 |   1.00 |
     IndexOfVector |        4 | 12.1471 ns | 0.0318 ns |  82,324,441.81 |   1.38 |
 IndexOfSequential |        5 |  9.5488 ns | 0.0219 ns | 104,725,439.75 |   1.00 |
     IndexOfVector |        5 | 14.0967 ns | 0.5159 ns |  70,938,373.83 |   1.48 |
 IndexOfSequential |        6 |  9.1586 ns | 0.0195 ns | 109,186,806.32 |   1.00 |
     IndexOfVector |        6 | 12.2056 ns | 0.0298 ns |  81,929,593.83 |   1.33 |
 IndexOfSequential |        7 |  9.0639 ns | 0.0297 ns | 110,327,783.91 |   1.00 |
     IndexOfVector |        7 | 12.1693 ns | 0.0330 ns |  82,174,250.94 |   1.34 |
 IndexOfSequential |        8 | 10.0264 ns | 0.0187 ns |  99,736,818.44 |   1.00 |
     IndexOfVector |        8 | 13.1485 ns | 0.0219 ns |  76,054,396.00 |   1.31 |
 IndexOfSequential |        9 | 10.3208 ns | 0.0275 ns |  96,891,576.60 |   1.00 |
     IndexOfVector |        9 | 13.1747 ns | 0.0242 ns |  75,903,337.87 |   1.28 |
 IndexOfSequential |       10 | 10.4098 ns | 0.0252 ns |  96,063,745.12 |   1.00 |
     IndexOfVector |       10 | 13.1398 ns | 0.0304 ns |  76,104,904.59 |   1.26 |
 IndexOfSequential |       11 | 10.8022 ns | 0.0208 ns |  92,574,062.49 |   1.00 |
     IndexOfVector |       11 | 13.2580 ns | 0.0417 ns |  75,425,977.24 |   1.23 |
 IndexOfSequential |       12 | 10.8137 ns | 0.0310 ns |  92,475,575.28 |   1.00 |
     IndexOfVector |       12 | 13.1738 ns | 0.0201 ns |  75,908,092.74 |   1.22 |
 IndexOfSequential |       13 | 11.6595 ns | 0.0248 ns |  85,766,675.02 |   1.00 |
     IndexOfVector |       13 | 13.1260 ns | 0.0277 ns |  76,184,748.47 |   1.13 |
 IndexOfSequential |       14 | 11.3840 ns | 0.0231 ns |  87,842,860.49 |   1.00 |
     IndexOfVector |       14 | 13.1927 ns | 0.0252 ns |  75,799,701.37 |   1.16 |
 IndexOfSequential |       15 | 11.0706 ns | 0.0284 ns |  90,329,019.67 |   1.00 |
     IndexOfVector |       15 | 13.1952 ns | 0.0304 ns |  75,785,301.56 |   1.19 |
 IndexOfSequential |       16 | 12.2549 ns | 0.0167 ns |  81,599,913.09 |   1.00 |
     IndexOfVector |       16 | 14.3018 ns | 0.0421 ns |  69,921,338.19 |   1.17 |
 IndexOfSequential |       17 | 12.3714 ns | 0.0258 ns |  80,831,372.78 |   1.00 |
     IndexOfVector |       17 | 14.2834 ns | 0.0338 ns |  70,011,346.70 |   1.15 |
 IndexOfSequential |       18 | 12.6749 ns | 0.0172 ns |  78,896,200.37 |   1.00 |
     IndexOfVector |       18 | 14.3005 ns | 0.0355 ns |  69,927,826.82 |   1.13 |
 IndexOfSequential |       30 | 15.6761 ns | 0.0302 ns |  63,791,577.30 |   1.00 |
     IndexOfVector |       30 | 15.3921 ns | 0.0392 ns |  64,968,417.78 |   0.98 |
 IndexOfSequential |       31 | 15.7588 ns | 0.0281 ns |  63,456,710.98 |   1.00 |
     IndexOfVector |       31 | 15.3798 ns | 0.0383 ns |  65,020,525.42 |   0.98 |
 IndexOfSequential |       32 | 17.4038 ns | 0.0228 ns |  57,458,857.12 |   1.00 |
     IndexOfVector |       32 | 16.2549 ns | 0.0295 ns |  61,519,921.89 |   0.93 |
 IndexOfSequential |       62 | 26.2066 ns | 0.0529 ns |  38,158,348.97 |   1.00 |
     IndexOfVector |       62 | 19.3232 ns | 0.0713 ns |  51,751,279.40 |   0.74 |
 IndexOfSequential |       63 | 26.2554 ns | 0.0598 ns |  38,087,404.06 |   1.00 |
     IndexOfVector |       63 | 18.9159 ns | 0.0335 ns |  52,865,591.94 |   0.72 |
 IndexOfSequential |       64 | 27.9149 ns | 0.0461 ns |  35,823,106.86 |   1.00 |
     IndexOfVector |       64 | 19.1125 ns | 0.0348 ns |  52,321,744.45 |   0.68 |
 IndexOfSequential |      126 | 47.6784 ns | 0.1215 ns |  20,973,863.11 |   1.00 |
     IndexOfVector |      126 | 25.8878 ns | 0.0584 ns |  38,628,299.47 |   0.54 |
 IndexOfSequential |      127 | 47.1735 ns | 0.1182 ns |  21,198,339.82 |   1.00 |
     IndexOfVector |      127 | 25.7767 ns | 0.0587 ns |  38,794,800.07 |   0.55 |
 IndexOfSequential |      128 | 48.8673 ns | 0.0959 ns |  20,463,571.07 |   1.00 |
     IndexOfVector |      128 | 27.0578 ns | 0.0750 ns |  36,957,885.78 |   0.55 |
 IndexOfSequential |      254 | 90.2877 ns | 0.2214 ns |  11,075,705.24 |   1.00 |
     IndexOfVector |      254 | 38.2056 ns | 0.0590 ns |  26,174,147.64 |   0.42 |
 IndexOfSequential |      255 | 91.2492 ns | 0.2278 ns |  10,958,995.40 |   1.00 |
     IndexOfVector |      255 | 36.1968 ns | 0.0387 ns |  27,626,725.27 |   0.40 |

Vector32

            Method | Position |       Mean |    StdDev |           Op/s | Scaled |
------------------ |--------- |----------- |---------- |--------------- |------- |
 IndexOfSequential |        0 |  5.5643 ns | 0.1039 ns | 179,716,020.96 |   1.00 |
     IndexOfVector |        0 | 18.0506 ns | 0.2805 ns |  55,399,722.65 |   3.25 |
 IndexOfSequential |        1 |  5.8655 ns | 0.1158 ns | 170,487,202.10 |   1.00 |
     IndexOfVector |        1 | 18.0863 ns | 0.2738 ns |  55,290,463.47 |   3.08 |
 IndexOfSequential |        2 |  5.9015 ns | 0.1221 ns | 169,447,604.97 |   1.00 |
     IndexOfVector |        2 | 17.9191 ns | 0.3297 ns |  55,806,481.54 |   3.04 |
 IndexOfSequential |        3 |  6.1491 ns | 0.1102 ns | 162,626,689.47 |   1.00 |
     IndexOfVector |        3 | 18.1680 ns | 0.3119 ns |  55,041,840.25 |   2.96 |
 IndexOfSequential |        4 |  6.4681 ns | 0.0992 ns | 154,605,617.25 |   1.00 |
     IndexOfVector |        4 | 18.1983 ns | 0.2913 ns |  54,950,182.40 |   2.81 |
 IndexOfSequential |        5 |  6.8263 ns | 0.1079 ns | 146,491,778.00 |   1.00 |
     IndexOfVector |        5 | 18.1523 ns | 0.3089 ns |  55,089,511.56 |   2.66 |
 IndexOfSequential |        6 |  6.7515 ns | 0.1110 ns | 148,116,241.20 |   1.00 |
     IndexOfVector |        6 | 18.0094 ns | 0.3142 ns |  55,526,558.20 |   2.67 |
 IndexOfSequential |        7 |  7.0542 ns | 0.0902 ns | 141,759,941.11 |   1.00 |
     IndexOfVector |        7 | 18.1601 ns | 0.2567 ns |  55,065,677.88 |   2.57 |
 IndexOfSequential |        8 |  7.6416 ns | 0.1609 ns | 130,863,130.12 |   1.00 |
     IndexOfVector |        8 | 18.6604 ns | 0.3545 ns |  53,589,288.93 |   2.44 |
 IndexOfSequential |        9 |  7.6666 ns | 0.1586 ns | 130,435,359.24 |   1.00 |
     IndexOfVector |        9 | 18.6551 ns | 0.3315 ns |  53,604,748.59 |   2.43 |
 IndexOfSequential |       10 |  8.0531 ns | 0.1250 ns | 124,176,104.69 |   1.00 |
     IndexOfVector |       10 | 18.8463 ns | 0.2125 ns |  53,060,782.72 |   2.34 |
 IndexOfSequential |       11 |  8.3141 ns | 0.0487 ns | 120,277,958.71 |   1.00 |
     IndexOfVector |       11 | 18.7984 ns | 0.2863 ns |  53,195,978.29 |   2.26 |
 IndexOfSequential |       12 |  8.5408 ns | 0.1539 ns | 117,084,611.14 |   1.00 |
     IndexOfVector |       12 | 18.8003 ns | 0.4093 ns |  53,190,513.06 |   2.20 |
 IndexOfSequential |       13 |  8.7511 ns | 0.1508 ns | 114,271,043.31 |   1.00 |
     IndexOfVector |       13 | 18.8665 ns | 0.2230 ns |  53,004,086.98 |   2.16 |
 IndexOfSequential |       14 |  8.9247 ns | 0.1290 ns | 112,048,052.45 |   1.00 |
     IndexOfVector |       14 | 18.8075 ns | 0.2621 ns |  53,170,205.86 |   2.11 |
 IndexOfSequential |       15 |  9.0995 ns | 0.1726 ns | 109,896,138.42 |   1.00 |
     IndexOfVector |       15 | 18.9161 ns | 0.3071 ns |  52,865,057.91 |   2.08 |
 IndexOfSequential |       16 |  9.3366 ns | 0.1837 ns | 107,105,766.78 |   1.00 |
     IndexOfVector |       16 | 19.6176 ns | 0.3283 ns |  50,974,515.16 |   2.10 |
 IndexOfSequential |       17 |  9.6791 ns | 0.1331 ns | 103,314,885.99 |   1.00 |
     IndexOfVector |       17 | 19.5913 ns | 0.3656 ns |  51,043,073.14 |   2.02 |
 IndexOfSequential |       18 |  9.9104 ns | 0.1762 ns | 100,904,582.08 |   1.00 |
     IndexOfVector |       18 | 19.6433 ns | 0.3302 ns |  50,907,846.41 |   1.98 |
 IndexOfSequential |       30 | 12.6302 ns | 0.2186 ns |  79,175,302.76 |   1.00 |
     IndexOfVector |       30 | 19.8281 ns | 0.3045 ns |  50,433,412.68 |   1.57 |
 IndexOfSequential |       31 | 12.8227 ns | 0.2251 ns |  77,986,495.35 |   1.00 |
     IndexOfVector |       31 | 19.7600 ns | 0.3225 ns |  50,607,173.84 |   1.54 |
 IndexOfSequential |       32 | 13.4547 ns | 0.2072 ns |  74,323,445.45 |   1.00 |
     IndexOfVector |       32 | 19.8120 ns | 0.3866 ns |  50,474,572.22 |   1.47 |
 IndexOfSequential |       62 | 20.3955 ns | 0.3955 ns |  49,030,351.90 |   1.00 |
     IndexOfVector |       62 | 21.2280 ns | 0.3216 ns |  47,107,548.66 |   1.04 |
 IndexOfSequential |       63 | 20.7587 ns | 0.2919 ns |  48,172,539.00 |   1.00 |
     IndexOfVector |       63 | 21.1920 ns | 0.3146 ns |  47,187,637.82 |   1.02 |
 IndexOfSequential |       64 | 21.0991 ns | 0.4755 ns |  47,395,320.82 |   1.00 |
     IndexOfVector |       64 | 21.6320 ns | 0.3180 ns |  46,227,833.27 |   1.03 |
 IndexOfSequential |      126 | 35.7981 ns | 0.5825 ns |  27,934,450.17 |   1.00 |
     IndexOfVector |      126 | 24.7291 ns | 0.3311 ns |  40,438,168.46 |   0.69 |
 IndexOfSequential |      127 | 36.0328 ns | 0.5868 ns |  27,752,478.02 |   1.00 |
     IndexOfVector |      127 | 24.5912 ns | 0.4800 ns |  40,664,894.33 |   0.68 |
 IndexOfSequential |      128 | 36.8935 ns | 0.3817 ns |  27,105,063.64 |   1.00 |
     IndexOfVector |      128 | 24.9595 ns | 0.3333 ns |  40,064,966.02 |   0.68 |
 IndexOfSequential |      254 | 67.9514 ns | 0.9889 ns |  14,716,408.52 |   1.00 |
     IndexOfVector |      254 | 32.5193 ns | 0.6350 ns |  30,750,958.82 |   0.48 |
 IndexOfSequential |      255 | 68.5728 ns | 0.7087 ns |  14,583,033.91 |   1.00 |
     IndexOfVector |      255 | 29.6946 ns | 0.5481 ns |  33,676,109.85 |   0.43 |

Looking at the results; it probably does need some kind of blend.

Also will try to use Unsafe.AsPointer and bytewise start to align the start of the vector search and see what that does

@benaadams
Copy link
Copy Markdown
Member Author

For netcoreapp 1.1 Vector cut overs are about 2x Vector.Count (e.g. 32 and 64)
Vector16 is at worse x1.5; Vector32 is ~x3

@benaadams
Copy link
Copy Markdown
Member Author

Have an idea...

@benaadams
Copy link
Copy Markdown
Member Author

It it worth merging this now and then following up with startup change? i.e. IndexOfVectorized has gone from corefxlab and SequentialIndexOf is available; which is causing issue for Kestrel updating corefxlab packages aspnet/KestrelHttpServer#1509

/cc @davidfowl

@benaadams
Copy link
Copy Markdown
Member Author

There is some bad codegen int the Vector path; trying to build with coreclr 2.0 to see if its fixed

@KrzysztofCwalina
Copy link
Copy Markdown
Member

Yeah, I think we should merge and then optimize the front. @davidfowl? @ahsonkhan?

@benaadams
Copy link
Copy Markdown
Member Author

Having trouble getting the asm for 2.0 but there is a bunch of code gen it hits that should have been improved: e.g. dotnet/coreclr#7407, https://github.com/dotnet/coreclr/issues/7843

@ahsonkhan
Copy link
Copy Markdown

Can we merge? @jkotas

@jkotas
Copy link
Copy Markdown
Member

jkotas commented Mar 21, 2017

LGTM

@jkotas jkotas merged commit 5775579 into dotnet:master Mar 21, 2017
@karelz karelz modified the milestone: 2.0.0 Mar 25, 2017
@benaadams benaadams deleted the span-indexof branch September 19, 2018 18:20
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants