Loop unrolled direct string inject#498
Conversation
f993941 to
3643f03
Compare
There was a problem hiding this comment.
I like that this PR removes more code than it adds, but this part now seems slightly more complicated. Are we sure that manual unrolling is faster?
There was a problem hiding this comment.
Yep: #399 (comment)
Non-unrolled (SequentialPointer)
Incrementing value on each unroll step (UnrolledPointer)
Adding to value on each step; and incrementing it at end (UnrolledParallelPointer)
There was a problem hiding this comment.
I remember that now 😄 It's interesting how UnrolledParallelPointer is so consistently better from 1 character all the way to 16384 character strings.
It might be interesting to see how something unaligned like a 33 character string performs, since I don't think any of the test cases hit both loops in GetAsciiStringImplementationB (UnrolledParallelPointer).
There was a problem hiding this comment.
Ah good point, will get that in new tests when doing looking for new numbers.
The ParallelPointer works because the cpu can pipeline the instructions as they are independent of each other (i.e. the effects of one line don't effect the results of the next); incrementing the pointer at each step means the next step can't run until that pointer addition is completed as its dependent on the result. Is also one of the problems with i++ in a regular loop; the next i++ can't be run until that has completed as it is dependent of its result.
There was a problem hiding this comment.
Also If I didn't unroll I think this would be a perf regression as the string is initialized; rather than being operations on initialized stack memory; then a memcopy to an uninitialized string; however I think the code being less weird makes up for that...
There was a problem hiding this comment.
I definitely like the code being less weird. Interesting point about the instruction pipelining. That explains the somewhat unintuitive results. It's amazing the stuff you have to consider at this level of optimization. Someday maybe the compiler will be smart enough to do this kind of thing for us.
I'm going to keep this PR open pending the new numbers.
3643f03 to
ea823bf
Compare
|
Updated with unrolling and improved instruction ordering from #519 |
There was a problem hiding this comment.
There is a further combining change that will this fixed block (as buffers are already fixed) - but this stands alone as is.
1d26e56 to
94fb53e
Compare
|
Thanks for all research/testing you put into this. LGTM. |
94fb53e to
cd502ac
Compare
cd502ac to
ea3e64a
Compare
|
Rebased |
|
Actually, before I merge this, do you have the test script and results that show this is the fasted way to create a string from ASCII bytes? #399 (comment) doesn't cover the current implementation. |
|
Sure :D |
|
Still need to do the benchmarking with various sizes; though taking a step as the bracketed chunk where 119 being 12 steps (120 dropping back to 10 steps as is multiple of 12)
Working for a 14 stage pipelined cpu so Is different for the non-byte conversion copy as int and longs (and vectors) can be used. |
|
Running benchmarks now |
|
Benchmark using Actually creating the strings to test properly. BenchmarkDotNet-Dev=v0.8.0.0+
OS=Microsoft Windows NT 6.2.9200.0
Processor=Intel(R) Core(TM) i7-4720HQ CPU @ 2.60GHz, ProcessorCount=8
HostCLR=MS.NET 4.0.30319.42000, Arch=64-bit [RyuJIT]
Type=AsciiCreate Mode=Throughput Platform=X64 Jit=RyuJit .NET=HostFramework toolchain=Classic Runtime=Clr Warmup=5 Target=10
Though |
|
Thanks for redoing the tests. |

+1% RPS - also less complicated; with a single code path rather than three,
Before
After