Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Issue #17118 Loop unrolling in Span.CopyTo slow path#18435

Merged
ahsonkhan merged 2 commits into
dotnet:masterfrom
WinCPP:Issue-17118-2
Apr 27, 2017
Merged

Issue #17118 Loop unrolling in Span.CopyTo slow path#18435
ahsonkhan merged 2 commits into
dotnet:masterfrom
WinCPP:Issue-17118-2

Conversation

@WinCPP
Copy link
Copy Markdown

@WinCPP WinCPP commented Apr 15, 2017

Fixes #17118.

@shiftylogic @jkotas Kindy review the code change. I have merged the forward and reverse paths using a direction variable. I am not sure if this would affect vectorization avenues if that was the intention to have unrolled loops. Kindly advise. If yes, then I think the option will be to write separate blocks for forward and reverse copy. Thanks.

@jkotas
Copy link
Copy Markdown
Member

jkotas commented Apr 15, 2017

This is performance fix. Could you please measure some performance numbers before and after for a few interesting cases - to verify that it is indeed making the code faster?

I have merged the forward and reverse paths

You need to measure the performance impact of such change. It is quite possible that this will make it slower than the trivial loop.

@WinCPP
Copy link
Copy Markdown
Author

WinCPP commented Apr 15, 2017

@jkotas Yup I will do it. Hmmm I understand it could impact performance - I was caught in dilemma - potential perf loss on one side and, code duplication and compile time size on the other...

I think of following combinations for performance testing,

  1. Different span: of int, of Guids and of class object references.
  2. Different sizes 2000 and 20000 of spans.
  3. Three different alogs - previous, new merged and new split.
  4. Each of above combinations to be run at least 10 times (?)

For the perf testing, would specifying just the framework, as in case of build and testing, be sufficient?

I think I will require some time to gather this :) In few hours from now, in the morning I've to go out for weekend. Hope its fine if I continue Monday onwards... Thanks!

@ahsonkhan
Copy link
Copy Markdown

Each of above combinations to be run at least 10 times (?)

I would think 3-5x would be enough to see if there is a regression or improvement.

@WinCPP WinCPP force-pushed the Issue-17118-2 branch 2 times, most recently from 95e5b7e to 1f64caf Compare April 18, 2017 02:44
@shiftylogic
Copy link
Copy Markdown
Contributor

@WinCPP Any results for how this impacts performance?

@WinCPP
Copy link
Copy Markdown
Author

WinCPP commented Apr 19, 2017

Hi @shiftylogic I'll be working on this. I got two other milestone 2.0 that I'm contributing to on priority. Can we hold off on this for a while. I want to work on this too... just that this being for 'future' milestone, I took liberty to give it a lower priority... Hope it is fine...

@WinCPP
Copy link
Copy Markdown
Author

WinCPP commented Apr 20, 2017

I have started working on this to gather performance data. I have been getting errors with performance testing framework, but I think it is time I got the sandbox is shape.

I have entire setup as per the 'performance testing' document in the repo wiki. On giving msbuild command, the performance tests run. But at the end it fails giving this descriptive message,

  [4/21/2017 1:15:06 AM][INF] Statistics written to "M:\corefx\bin\Windows_NT.AnyCPU.Release\System.Memory.Performance.Tests\netcoreapp\Perf-System.Memory.Performance.Tests.csv"
  E:\Program Files (x86)\Python36-32\python.exe: can't open file 'M:\corefx\Tools/Microsoft.BenchView.JSONFormat\tools\measurement.py': [Errno 2] No such file or directory
  Finished running tests.  End time= 1:15:07.83, Exit code = 2

I have run repair for VS 2015 community edition. I don't know what more is required to have measurement.py. Is there some way by which I can force download the missing tools, if that is the case here?

@shiftylogic @jkotas @stephentoub @karelz Kindly advise.

@karelz
Copy link
Copy Markdown
Member

karelz commented Apr 21, 2017

@DrewScoggins @mellinoe can you please help troubleshoot the perf infra failures?

@ahsonkhan
Copy link
Copy Markdown

I did a clean build of corefx and tried to run the performance tests (following this):

D:\GitHub\Fork\corefx\src\System.Memory\tests>msbuild /t:BuildAndTest /p:Performance=true /p:ConfigurationGroup=Release /p:TargetOS=Windows_NT

I get the following errors:
CSC : error CS0006: Metadata file 'D:\GitHub\Fork\corefx\bin/runtime/netcoreapp-Windows_NT-Release-x64/xunit.core.dll' could not be found [D:\GitHub\Fork\corefx\src\System.Memory\tests\System.Memory.Tests.csproj]
CSC : error CS0006: Metadata file 'D:\GitHub\Fork\corefx\bin/runtime/netcoreapp-Windows_NT-Release-x64/Xunit.NetCore.Extensions.dll' could not be found [D:\GitHub\Fork\corefx\src\System.Memory\tests\System.Memory.Tests.csproj]
CSC : error CS0006: Metadata file 'D:\GitHub\Fork\corefx\bin/runtime/netcoreapp-Windows_NT-Release-x64/xunit.assert.dll' could not be found [D:\GitHub\Fork\corefx\src\System.Memory\tests\System.Memory.Tests.csproj]
CSC : error CS0006: Metadata file 'D:\GitHub\Fork\corefx\bin/runtime/netcoreapp-Windows_NT-Release-x64/xunit.abstractions.dll' could not be found [D:\GitHub\Fork\corefx\src\System.Memory\tests\System.Memory.Tests.csproj]
CSC : error CS0006: Metadata file 'D:\GitHub\Fork\corefx\bin/runtime/netcoreapp-Windows_NT-Release-x64/xunit.performance.core.dll' could not be found [D:\GitHub\Fork\corefx\src\System.Memory\tests\System.Memory.Tests.csproj]
CSC : error CS0006: Metadata file 'D:\GitHub\Fork\corefx\bin/runtime/netcoreapp-Windows_NT-Release-x64/xunit.performance.api.dll' could not be found [D:\GitHub\Fork\corefx\src\System.Memory\tests\System.Memory.Tests.csproj]
Done Building Project "D:\GitHub\Fork\corefx\src\System.Memory\tests\System.Memory.Tests.csproj" (BuildAndTest target(s)) -- FAILED.

Build FAILED.

@mellinoe
Copy link
Copy Markdown
Contributor

@ahsonkhan Have you done the "source build" (build.cmd) in release mode first? That error seems like the one you'd get if you didn't.

@ahsonkhan
Copy link
Copy Markdown

ahsonkhan commented Apr 21, 2017

Have you done the "source build" (build.cmd) in release mode first? That error seems like the one you'd get if you didn't.

No, I hadn't. I ran build.cmd -release. That resolved the issue I mentioned above.

I get this error now (and the testResults.xml doesn't exist in the bin directory):
RunTestsForProject:
D:\GitHub\Fork\corefx\bin/AnyOS.AnyCPU.Release/System.Memory.Tests/netcoreapp//RunTests.cmd D:\GitHub\Fork\corefx\bin/testhost/netcoreapp-Windows_NT-Release-x64/
Using D:\GitHub\Fork\corefx\bin\testhost\netcoreapp-Windows_NT-Release-x64\ as the test runtime folder.
Executing in D:\GitHub\Fork\corefx\bin\AnyOS.AnyCPU.Release\System.Memory.Tests\netcoreapp
Running tests... Start time: 18:59:36.10
Command(s):
D:\GitHub\Fork\corefx\bin\testhost\netcoreapp-Windows_NT-Release-x64\dotnet.exe PerfRunner.exe --perf:runid Perf
if exist Perf-System.Memory.Tests.xml (
py D:\GitHub\Fork\corefx\Tools/Microsoft.BenchView.JSONFormat\tools\measurement.py xunit Perf-System.Memory.Tests.xml --better desc --drop-first-value --append -o D:\GitHub\Fork\corefx\measurement.json
)
The application to execute does not exist: 'D:\GitHub\Fork\corefx\bin\AnyOS.AnyCPU.Release\System.Memory.Tests\netcoreapp\PerfRunner.exe'

Finished running tests. End time=18:59:36.11, Exit code = -2147450751
D:\GitHub\Fork\corefx\Tools\tests.targets(326,5): warning MSB3073: The command "D:\GitHub\Fork\corefx\bin/AnyOS.AnyCPU.Release/System.Memory.Tests/netcoreapp//RunTests.cmd D:\GitHub\Fork\corefx\bin/testhost/netcoreapp-Windows_NT-Release
-x64/" exited with code -2147450751. [D:\GitHub\Fork\corefx\src\System.Memory\tests\System.Memory.Tests.csproj]
The previous error was converted to a warning because the task was called with ContinueOnError=true.
Build continuing because "ContinueOnError" on the task "Exec" is set to "true".
D:\GitHub\Fork\corefx\Tools\tests.targets(334,5): error : One or more tests failed while running tests from 'System.Memory.Tests' please check D:\GitHub\Fork\corefx\bin/AnyOS.AnyCPU.Release/System.Memory.Tests/netcoreapp/testResults.xml
for details! [D:\GitHub\Fork\corefx\src\System.Memory\tests\System.Memory.Tests.csproj]
Done Building Project "D:\GitHub\Fork\corefx\src\System.Memory\tests\System.Memory.Tests.csproj" (BuildAndTest target(s)) -- FAILED.

Build FAILED.

@WinCPP
Copy link
Copy Markdown
Author

WinCPP commented Apr 21, 2017

@mellinoe ... about the issue that I'm facing. The folder "M:\corefx\Tools/Microsoft.BenchView.JSONFormat" itself doesn't exist. Looks like it didn't get downloaded from the build repo? Is there someway to force download the missing tools? because simple 'clean.cmd' followed by 'build.cmd' doesn't seem to be pulling it down...

@DrewScoggins
Copy link
Copy Markdown
Member

I have a PR out to fix this issue. The main problem is that we did not have the calls to the tools that we use to upload the data to our results service completely hidden by the logging flag. This change should fix that. In the meantime you can get the tooling by using the command I pasted below. This will unblock you. You should replace %WORKSPACE% with the root of the CoreFX repo. Also of course ensure that you have a copy of nuget.exe.

C:\Tools\nuget.exe install Microsoft.BenchView.JSONFormat -Source http://benchviewtestfeed.azurewebsites.net/nuget -OutputDirectory "%WORKSPACE%\Tools" -Prerelease -ExcludeVersion

@WinCPP
Copy link
Copy Markdown
Author

WinCPP commented Apr 21, 2017

@DrewScoggins awesome! It solved my problem. I'm set to design and execute the perf tests.... Thanks!

Between, I just wanted to point out that the performance testing steps document (here) mentions "...run from the tests directory." [This is first line in second paragraph for Windows section under "Running the tests" section.] Actually when run in the 'tests' directory it gives 'PerfRunner.exe' does not exist... I think the line should mention running the msbuild command in "tests\Performance" directory. Only then did I get the expected output. Thanks!

@WinCPP
Copy link
Copy Markdown
Author

WinCPP commented Apr 22, 2017

@jkotas @shiftylogic Need help with the command for running performance tests for the slow path. Following is the command that I'm running in the src\System.Memory\tests\performance directory.

msbuild /t:RebuildAndTest /p:Performance=true /p:ConfigurationGroup=Release /p:TargetGroup=netfx

This is mix of the command mentioned on performance test help page (here) and @jkotas 's comment on the issue page (here) about how to invoke slow path i.e. netfx framework... Kindly let me know if the above command makes sense. I do not see exeisting performance tests being executed when I issue the command. However if I use the command on the performance test page (for Windows_NT OS), it runs and dumps statistics on the screen.

Kindly help.

@WinCPP
Copy link
Copy Markdown
Author

WinCPP commented Apr 23, 2017

@shiftylogic @jkotas @ahsonkhan @karelz @mellinoe Hope I am not causing inconvenience by this thread. But the perf test configuration for TargetGroup netfx is not going smooth. From the 'developer guide' and 'project guidelines' documents, I figured out that I need to additionally mention /p:TargetOS=Windows_NT, so the full command for performance testing with netfx framework should be (I think),

msbuild /t:RebuildAndTest /p:Performance=true /p:ConfigurationGroup=Release /p:TargetGroup=netfx /p:TargetOS=Windows_NT

With that the performance loop was triggered but I got a new error related to dotnet.exe not being present in the testhost directory. Snippet of the relevant error lines is towards end of this comment. In the folder D:\WinCPP\corefx\bin\testhost, the structure of netfx-Windows_NT-Release-x64 folder is way different from netcoreapp-Windows_NT-Debug-x64. The latter has dotnet.exe but the former (netfx) just has a dump of various assemblies, I think from the build.

I am now trying to figure out how to get dotnet.exe into testhost\netfx* folder.

The console output of interest, as I mentioned above, is given below. Line of interest (dotnet.exe) is 4th from top.

RunTestsForProject:
  D:\WinCPP\corefx\bin/AnyOS.AnyCPU.Release/System.Memory.Performance.Tests/netfx//RunTests.cmd D:\WinCPP\corefx\bin/testhost/netfx-Windows_NT-Release-x64/
  Using D:\WinCPP\corefx\bin\testhost\netfx-Windows_NT-Release-x64\ as the test runtime folder.
  'D:\WinCPP\corefx\bin\testhost\netfx-Windows_NT-Release-x64\\dotnet.exe' is not recognized as an internal or external command,
  operable program or batch file.
  Executing in D:\WinCPP\corefx\bin\AnyOS.AnyCPU.Release\System.Memory.Performance.Tests\netfx\
  Running tests... Start time: 10:44:17.35
  Command(s):
  set DEVPATH=D:\WinCPP\corefx\bin\testhost\netfx-Windows_NT-Release-x64\
  D:\WinCPP\corefx\bin\testhost\netfx-Windows_NT-Release-x64\\dotnet.exe PerfRunner.exe --perf:runid Perf
  if exist Perf-System.Memory.Performance.Tests.xml (
  py D:\WinCPP\corefx\Tools/Microsoft.BenchView.JSONFormat\tools\measurement.py xunit Perf-System.Memory.Performance.Tests.xml --better desc --drop-first-value --append -o D:\WinCPP\corefx\measurement.json
  )
  Finished running tests.  End time=10:44:17.35, Exit code = 9009
D:\WinCPP\corefx\Tools\tests.targets(326,5): warning MSB3073: The command "D:\WinCPP\corefx\bin/AnyOS.AnyCPU.Release/System.Memory.Performance.Tests/netfx//RunTests.cmd D:\WinCPP\corefx\bin/testhost/netfx-Windows_NT-Release-x64/" exited
 with code 9009. [D:\WinCPP\corefx\src\System.Memory\tests\Performance\System.Memory.Performance.Tests.csproj]
  The previous error was converted to a warning because the task was called with ContinueOnError=true.
  Build continuing because "ContinueOnError" on the task "Exec" is set to "true".
D:\WinCPP\corefx\Tools\tests.targets(334,5): error : One or more tests failed while running tests from 'System.Memory.Performance.Tests' please check D:\WinCPP\corefx\bin/AnyOS.AnyCPU.Release/System.Memory.Performance.Tests/netfx/testRe
sults.xml for details! [D:\WinCPP\corefx\src\System.Memory\tests\Performance\System.Memory.Performance.Tests.csproj]
Done Building Project "D:\WinCPP\corefx\src\System.Memory\tests\Performance\System.Memory.Performance.Tests.csproj" (RebuildAndTest target(s)) -- FAILED.```

@karelz
Copy link
Copy Markdown
Member

karelz commented Apr 23, 2017

@ahsonkhan @shiftylogic @KrzysztofCwalina can you please help with guidance how you guys do perf comparisons?

@DrewScoggins can you please try to get @WinCPP unblocked?

@WinCPP if you need to compare couple of tests, try using BenchmarkDotNet for one off perf test measurements ... to get yourself unblocked.

@WinCPP
Copy link
Copy Markdown
Author

WinCPP commented Apr 23, 2017

I tried collecting data by changing the existing normal test to print performance data. (I wanted to finish this round of perf data today... hence the short cut...)

The data is towards end of the comment. Meaning of various keywords in the tables are as follows,

  • Existing - current implementation without loop unrolling that has two separate loops for foward and reverse traversal respectively. Data set (1) and (4).
  • Two loops - 'Existing' forward and reverse loops modified with loop unrolling. Data set (2) and (5).
  • One loop - Combined loop with direction variable to indicate forward or reverse index access (current version in PR - it had a 'mul' operation in generated IL). Data set (3) and (6).
  • int vs value type data - Each of the above, were tested with a span of ints and a value type with an two ints, a long and a char with respective data being in column (a) and (b).
  • Loop direction - forward and reverse direction for the loops that copy data from source span to destination span with respective data in Table A and Table B. Forward loop is hit when source span begins after destination span and reverse is hit when source span begins before destination span.

Looks like the One Loop implementation that I had pushed previous doesn't have any more benefits, in fact is worse in some case. So I will replace that with Two Loops implementation.

Between Existing and Two Loops implementation, I am not able to make a call. Latter appears to show better performance for span of ints (data set 2a, 5a vs 1a, 4a) but has negligible effect in case of value types (data set 2b, 5b vs 1b, 4b). @jkotas appreciate your inputs. Thanks!

+--------+-----------------------------------------------------+
|        |             Copy loop direction: Forward            |
|        |             Copy direction: Dest <- Src             |
|        |         Source starts later than desitnation        |
|        +-----------------+-----------------+-----------------+
|        |   Existing (1)  |  Two Loops (2)  |  One Loop (3)   |
|        +--------+--------+--------+--------+--------+--------+
|        |   (a)  |  (b)   |   (a)  |  (b)   |   (a)  |  (b)   |
|        +--------+--------+--------+--------+--------+--------+
|        | 161    | 486    | 152    | 483    | 162    | 487    |
|        | 148    | 399    | 139    | 394    | 149    | 400    |
|        | 153    | 466    | 144    | 496    | 154    | 519    |
|        | 153    | 455    | 145    | 429    | 155    | 437    |
|        | 154    | 451    | 144    | 445    | 154    | 438    |
|        | 153    | 449    | 144    | 440    | 154    | 447    |
|   T    | 153    | 442    | 144    | 436    | 156    | 479    |
|   A    | 154    | 424    | 144    | 433    | 156    | 521    |
|   B    | 153    | 502    | 144    | 495    | 154    | 435    |
|   L    | 154    | 444    | 146    | 457    | 156    | 438    |
|   E    | 153    | 435    | 144    | 433    | 156    | 436    |
|        | 153    | 437    | 145    | 434    | 155    | 434    |
|   A    | 153    | 446    | 144    | 430    | 154    | 438    |
|        | 154    | 438    | 147    | 420    | 154    | 411    |
|        | 153    | 433    | 144    | 485    | 156    | 434    |
|        | 155    | 435    | 146    | 433    | 155    | 439    |
|        | 153    | 435    | 145    | 426    | 155    | 439    |
|        | 153    | 441    | 145    | 429    | 154    | 435    |
|        | 153    | 477    | 145    | 426    | 155    | 524    |
|        | 152    | 433    | 145    | 522    | 155    | 431    |
+--------+--------+--------+--------+--------+--------+--------+
|        | 153    | 444.32 | 144.42 | 445.42 | 154.58 | 449.21 |
|        |   1.34 |  20.7  |   1.53 |  30.88 |   1.53 |  34.41 |
+--------+--------+--------+--------+--------+--------+--------+

+--------+-----------------------------------------------------+
|        |             Copy loop direction: Reverse            |
|        |             Copy direction: Src -> Dest             |
|        |       Source starts earlier than destination        |
|        +-----------------+-----------------+-----------------+
|        |   Existing (1)  |  Two Loops (2)  |  One Loop (3)   |
|        +--------+--------+--------+--------+--------+--------+
|        |   (a)  |  (b)   |   (a)  |  (b)   |   (a)  |  (b)   |
|        +--------+--------+--------+--------+--------+--------+
|        | 157    | 415    | 147    | 403    | 154    | 402    |
|        | 167    | 410    | 143    | 384    | 150    | 398    |
|        | 164    | 500    | 149    | 520    | 156    | 513    |
|        | 162    | 550    | 150    | 540    | 153    | 434    |
|        | 162    | 448    | 148    | 437    | 153    | 518    |
|        | 163    | 511    | 150    | 457    | 156    | 551    |
|   T    | 162    | 599    | 149    | 430    | 155    | 451    |
|   A    | 163    | 472    | 149    | 458    | 155    | 469    |
|   B    | 163    | 455    | 149    | 442    | 154    | 441    |
|   L    | 162    | 451    | 149    | 467    | 154    | 480    |
|   E    | 164    | 485    | 149    | 476    | 154    | 540    |
|        | 163    | 499    | 150    | 447    | 154    | 451    |
|   B    | 163    | 451    | 149    | 447    | 153    | 448    |
|        | 162    | 454    | 149    | 444    | 154    | 445    |
|        | 162    | 445    | 149    | 428    | 154    | 443    |
|        | 162    | 443    | 149    | 515    | 154    | 444    |
|        | 163    | 511    | 149    | 436    | 154    | 457    |
|        | 162    | 447    | 149    | 441    | 155    | 462    |
|        | 163    | 449    | 149    | 439    | 154    | 438    |
|        | 163    | 445    | 149    | 441    | 154    | 443    |
+--------+--------+--------+--------+--------+--------+--------+
|        | 162.89 | 475    | 148.79 | 455.21 | 154    | 464.53 |
|        |   1.17 |  43.61 |   1.44 |  35.39 |   1.26 |  38.17 |
+--------+--------+--------+--------+--------+--------+--------+

@karelz
Copy link
Copy Markdown
Member

karelz commented Apr 23, 2017

@WinCPP can you put your modifications / experiments into a gist or somewhere publicly accessible? It's super-useful when people want to double-check your changes or when they have idea they want to measure & build on top of your changes ...

@jkotas
Copy link
Copy Markdown
Member

jkotas commented Apr 23, 2017

+1 Could you please share the exact source of the test? In particular, I would like to know the block size that you are using for the test.

WinCPP referenced this pull request in WinCPP/corefx Apr 24, 2017
@WinCPP
Copy link
Copy Markdown
Author

WinCPP commented Apr 24, 2017

Kindly refer to the following commit in another branch in my fork for the source. It has different CopyTo versions that I used for testing with renaming and also the test wrapper. I have added comments there to explain what I was trying to do. Thanks!

Commit link: WinCPP@8881ede

@jkotas
Copy link
Copy Markdown
Member

jkotas commented Apr 24, 2017

CopyPerfTestWrapperBackward<T>(20000000, iterationCount, timeSpent);

You should measure different blockSizes. 20M block size won't fit into the cache, and so the micro benchmark will be likely dominated by the memory latency. It may explain why you are not seeing much difference between different variants of the code.

@DrewScoggins
Copy link
Copy Markdown
Member

It looks like for actually collecting performance numbers we are kind of unblocked, by what @WinCPP did. As for the actual problem at hand we have never tested or done any work to make performance tests run on any configuration other than the default, netcoreapp. It certainly would be possible to do the work to make the tests work under this configuration, but I am not sure what the benefits would be nor do I have a good idea right now of that amount of work involved.

@karelz
Copy link
Copy Markdown
Member

karelz commented Apr 24, 2017

@DrewScoggins the benefits are obvious (at least to me). I thought this was always part of the work as well. Let's chat to poke more at the gaps of expectations here ...

@shiftylogic
Copy link
Copy Markdown
Contributor

I took his code change and ran some performance numbers on it. Below are the results.

It appears that the unrolled version gets us ~15-20% (give or take with noise) for non-trivial sized buffers.

NOTE: Ignore the "Fast?" column. It only makes sense if I run the tests that include taking the fast path through (copy block).

Tag Length Base Unrolled Ratio Fast? Forward?
inside 16 14 15 1.07 False True
overlap front 16 11 11 1.00 False False
overlap back 16 14 13 0.93 False True
covers head 16 10 13 1.30 False False
covers tail 16 12 7 0.58 False True
inside 256 89 51 0.57 False True
overlap front 256 62 85 1.37 False False
overlap back 256 54 44 0.81 False True
covers head 256 53 46 0.87 False False
covers tail 256 56 46 0.82 False True
inside 2048 367 337 0.92 False True
overlap front 2048 405 316 0.78 False False
overlap back 2048 376 309 0.82 False True
covers head 2048 372 316 0.85 False False
covers tail 2048 376 314 0.84 False True
inside 4096 753 654 0.87 False True
overlap front 4096 791 608 0.77 False False
overlap back 4096 2148 1869 0.87 False True
covers head 4096 762 606 0.80 False False
covers tail 4096 742 609 0.82 False True
inside 16384 2850 2383 0.84 False True
overlap front 16384 3149 2463 0.78 False False
overlap back 16384 2884 2399 0.83 False True
covers head 16384 2837 2294 0.81 False False
covers tail 16384 2897 2396 0.83 False True
inside 10485760 1864822 1541150 0.83 False True
overlap front 10485760 2034064 1531378 0.75 False False
overlap back 10485760 1963735 1504714 0.77 False True
covers head 10485760 1837571 1587331 0.86 False False
covers tail 10485760 1853799 1516896 0.82 False True

@WinCPP
Copy link
Copy Markdown
Author

WinCPP commented Apr 25, 2017

Sorry guys, got held up at work... @shiftylogic thanks for looking into it. Just asking, is the data using the version of "Two Loops" version of the CopyTo on other branch...? Based on further instructions, I will check-in that version into this PR and resolve the conflicts... Thanks!

@karelz
Copy link
Copy Markdown
Member

karelz commented Apr 25, 2017

FYI: We discussed with @DrewScoggins the need to have ability to run perf tests against Desktop / current NuGet packages targeting Desktop.
We concluded to:

  1. Put it on the perf team backlog (@DrewScoggins can you please link the issue here when you create it?)
  2. Update docs now saying, it doesn't work yet (@DrewScoggins please link the issue/PR/commit here as well, thanks!)

@shiftylogic
Copy link
Copy Markdown
Contributor

@WinCPP No, these numbers are the one loop variant. I didn't test the two loop variant.

@shiftylogic
Copy link
Copy Markdown
Contributor

If you provide me the snippet of code that does the two-loop variant, I can run those numbers quickly and compare.

Either way, you also need to fix the commit conflict due to the bug fix for overlap detection that was merged this morning. Shouldn't impact your actual change.

@WinCPP
Copy link
Copy Markdown
Author

WinCPP commented Apr 25, 2017

Ah! So two loop variant needs data to be generated...? @shiftylogic is your framework shared somewhere that I could use it? My test app is too ad hoc and requires a lot of manual data collation...

@WinCPP
Copy link
Copy Markdown
Author

WinCPP commented Apr 25, 2017

@shiftylogic oops our replies crossed each other, i just noticed. the other loop is here... (link)

So based on the outputs and recommendation from you and @jkotas I will pick up the approved implementation and check it in with merge conflict resolution...

@shiftylogic
Copy link
Copy Markdown
Contributor

It appears that the "two loop" variant of this change results in no performance gain at all. I'm digging into why this is, but the JIT generates better code for the "one loop" variant. I'm talking to the JIT team about what is causing this.

For now, can you please resolve the conflicts for the "one loop" variant and we can take this change. It gives us a decent perf bump.

@WinCPP
Copy link
Copy Markdown
Author

WinCPP commented Apr 26, 2017

@shiftylogic I have resolved the conflict and the builds have passed...

Between, if it is not off-limits for me (IPR, etc.), would you mind sharing the gist of your discussion with the JIT team about the code generated for "one / two loop" variants... I would love to read. Thanks!

@shiftylogic
Copy link
Copy Markdown
Contributor

The one-loop variant has a bit of extra math (via the direction variable) that causes the JIT to do CSE on the subexpressions (runCount + direction * n) into a temp which resulted in slightly better code generation. The JIT decided that CSE wasn't necessary in the two-loop variant and thus resulted in many extra instructions being generated.

@danmoseley
Copy link
Copy Markdown
Member

@ahsonkhan if this is approved should you merge?

@ahsonkhan ahsonkhan merged commit 10afd3f into dotnet:master Apr 27, 2017
@WinCPP
Copy link
Copy Markdown
Author

WinCPP commented Apr 27, 2017

@shiftylogic thanks for sharing! I'm sure intricacies must be lot interesting... :) Thanks!

@karelz karelz modified the milestone: 2.0.0 Apr 28, 2017
@karelz
Copy link
Copy Markdown
Member

karelz commented May 1, 2017

FYI: Here's the tracking issue #19200 for:

FYI: We discussed with @DrewScoggins the need to have ability to run perf tests against Desktop / current NuGet packages targeting Desktop.
We concluded to:

  1. Put it on the perf team backlog (@DrewScoggins can you please link the issue here when you create it?)

@WinCPP WinCPP deleted the Issue-17118-2 branch September 9, 2019 03:42
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants