Issue #17118 Loop unrolling in Span.CopyTo slow path#18435
Conversation
|
This is performance fix. Could you please measure some performance numbers before and after for a few interesting cases - to verify that it is indeed making the code faster?
You need to measure the performance impact of such change. It is quite possible that this will make it slower than the trivial loop. |
|
@jkotas Yup I will do it. Hmmm I understand it could impact performance - I was caught in dilemma - potential perf loss on one side and, code duplication and compile time size on the other... I think of following combinations for performance testing,
For the perf testing, would specifying just the framework, as in case of build and testing, be sufficient? I think I will require some time to gather this :) In few hours from now, in the morning I've to go out for weekend. Hope its fine if I continue Monday onwards... Thanks! |
I would think 3-5x would be enough to see if there is a regression or improvement. |
95e5b7e to
1f64caf
Compare
|
@WinCPP Any results for how this impacts performance? |
|
Hi @shiftylogic I'll be working on this. I got two other milestone 2.0 that I'm contributing to on priority. Can we hold off on this for a while. I want to work on this too... just that this being for 'future' milestone, I took liberty to give it a lower priority... Hope it is fine... |
|
I have started working on this to gather performance data. I have been getting errors with performance testing framework, but I think it is time I got the sandbox is shape. I have entire setup as per the 'performance testing' document in the repo wiki. On giving msbuild command, the performance tests run. But at the end it fails giving this descriptive message, I have run repair for VS 2015 community edition. I don't know what more is required to have measurement.py. Is there some way by which I can force download the missing tools, if that is the case here? @shiftylogic @jkotas @stephentoub @karelz Kindly advise. |
|
@DrewScoggins @mellinoe can you please help troubleshoot the perf infra failures? |
|
I did a clean build of corefx and tried to run the performance tests (following this):
I get the following errors: Build FAILED. |
|
@ahsonkhan Have you done the "source build" (build.cmd) in release mode first? That error seems like the one you'd get if you didn't. |
No, I hadn't. I ran I get this error now (and the testResults.xml doesn't exist in the bin directory): Finished running tests. End time=18:59:36.11, Exit code = -2147450751 Build FAILED. |
|
@mellinoe ... about the issue that I'm facing. The folder "M:\corefx\Tools/Microsoft.BenchView.JSONFormat" itself doesn't exist. Looks like it didn't get downloaded from the build repo? Is there someway to force download the missing tools? because simple 'clean.cmd' followed by 'build.cmd' doesn't seem to be pulling it down... |
|
I have a PR out to fix this issue. The main problem is that we did not have the calls to the tools that we use to upload the data to our results service completely hidden by the logging flag. This change should fix that. In the meantime you can get the tooling by using the command I pasted below. This will unblock you. You should replace %WORKSPACE% with the root of the CoreFX repo. Also of course ensure that you have a copy of nuget.exe.
|
|
@DrewScoggins awesome! It solved my problem. I'm set to design and execute the perf tests.... Thanks! Between, I just wanted to point out that the performance testing steps document (here) mentions "...run from the tests directory." [This is first line in second paragraph for Windows section under "Running the tests" section.] Actually when run in the 'tests' directory it gives 'PerfRunner.exe' does not exist... I think the line should mention running the msbuild command in "tests\Performance" directory. Only then did I get the expected output. Thanks! |
|
@jkotas @shiftylogic Need help with the command for running performance tests for the slow path. Following is the command that I'm running in the src\System.Memory\tests\performance directory. This is mix of the command mentioned on performance test help page (here) and @jkotas 's comment on the issue page (here) about how to invoke slow path i.e. netfx framework... Kindly let me know if the above command makes sense. I do not see exeisting performance tests being executed when I issue the command. However if I use the command on the performance test page (for Windows_NT OS), it runs and dumps statistics on the screen. Kindly help. |
|
@shiftylogic @jkotas @ahsonkhan @karelz @mellinoe Hope I am not causing inconvenience by this thread. But the perf test configuration for TargetGroup netfx is not going smooth. From the 'developer guide' and 'project guidelines' documents, I figured out that I need to additionally mention With that the performance loop was triggered but I got a new error related to I am now trying to figure out how to get The console output of interest, as I mentioned above, is given below. Line of interest (dotnet.exe) is 4th from top. |
|
@ahsonkhan @shiftylogic @KrzysztofCwalina can you please help with guidance how you guys do perf comparisons? @DrewScoggins can you please try to get @WinCPP unblocked? @WinCPP if you need to compare couple of tests, try using BenchmarkDotNet for one off perf test measurements ... to get yourself unblocked. |
|
I tried collecting data by changing the existing normal test to print performance data. (I wanted to finish this round of perf data today... hence the short cut...) The data is towards end of the comment. Meaning of various keywords in the tables are as follows,
Looks like the One Loop implementation that I had pushed previous doesn't have any more benefits, in fact is worse in some case. So I will replace that with Two Loops implementation. Between Existing and Two Loops implementation, I am not able to make a call. Latter appears to show better performance for span of |
|
@WinCPP can you put your modifications / experiments into a gist or somewhere publicly accessible? It's super-useful when people want to double-check your changes or when they have idea they want to measure & build on top of your changes ... |
|
+1 Could you please share the exact source of the test? In particular, I would like to know the block size that you are using for the test. |
|
Kindly refer to the following commit in another branch in my fork for the source. It has different CopyTo versions that I used for testing with renaming and also the test wrapper. I have added comments there to explain what I was trying to do. Thanks! Commit link: WinCPP@8881ede |
You should measure different blockSizes. 20M block size won't fit into the cache, and so the micro benchmark will be likely dominated by the memory latency. It may explain why you are not seeing much difference between different variants of the code. |
|
It looks like for actually collecting performance numbers we are kind of unblocked, by what @WinCPP did. As for the actual problem at hand we have never tested or done any work to make performance tests run on any configuration other than the default, netcoreapp. It certainly would be possible to do the work to make the tests work under this configuration, but I am not sure what the benefits would be nor do I have a good idea right now of that amount of work involved. |
|
@DrewScoggins the benefits are obvious (at least to me). I thought this was always part of the work as well. Let's chat to poke more at the gaps of expectations here ... |
|
I took his code change and ran some performance numbers on it. Below are the results. It appears that the unrolled version gets us ~15-20% (give or take with noise) for non-trivial sized buffers. NOTE: Ignore the "Fast?" column. It only makes sense if I run the tests that include taking the fast path through (copy block).
|
|
Sorry guys, got held up at work... @shiftylogic thanks for looking into it. Just asking, is the data using the version of "Two Loops" version of the |
|
FYI: We discussed with @DrewScoggins the need to have ability to run perf tests against Desktop / current NuGet packages targeting Desktop.
|
|
@WinCPP No, these numbers are the one loop variant. I didn't test the two loop variant. |
|
If you provide me the snippet of code that does the two-loop variant, I can run those numbers quickly and compare. Either way, you also need to fix the commit conflict due to the bug fix for overlap detection that was merged this morning. Shouldn't impact your actual change. |
|
Ah! So two loop variant needs data to be generated...? @shiftylogic is your framework shared somewhere that I could use it? My test app is too ad hoc and requires a lot of manual data collation... |
|
@shiftylogic oops our replies crossed each other, i just noticed. the other loop is here... (link) So based on the outputs and recommendation from you and @jkotas I will pick up the approved implementation and check it in with merge conflict resolution... |
|
It appears that the "two loop" variant of this change results in no performance gain at all. I'm digging into why this is, but the JIT generates better code for the "one loop" variant. I'm talking to the JIT team about what is causing this. For now, can you please resolve the conflicts for the "one loop" variant and we can take this change. It gives us a decent perf bump. |
|
@shiftylogic I have resolved the conflict and the builds have passed... Between, if it is not off-limits for me (IPR, etc.), would you mind sharing the gist of your discussion with the JIT team about the code generated for "one / two loop" variants... I would love to read. Thanks! |
|
The one-loop variant has a bit of extra math (via the direction variable) that causes the JIT to do CSE on the subexpressions (runCount + direction * n) into a temp which resulted in slightly better code generation. The JIT decided that CSE wasn't necessary in the two-loop variant and thus resulted in many extra instructions being generated. |
|
@ahsonkhan if this is approved should you merge? |
|
@shiftylogic thanks for sharing! I'm sure intricacies must be lot interesting... :) Thanks! |
|
FYI: Here's the tracking issue #19200 for:
|
Fixes #17118.
@shiftylogic @jkotas Kindy review the code change. I have merged the forward and reverse paths using a
directionvariable. I am not sure if this would affect vectorization avenues if that was the intention to have unrolled loops. Kindly advise. If yes, then I think the option will be to write separate blocks for forward and reverse copy. Thanks.