templated array ops by MartinNowak · Pull Request #1891 · dlang/druntime

MartinNowak · 2017-07-26T11:04:30Z

runtime implementation of templated array operations w/ vectorization
expresion tree is passed in RPN to encode operator precendence
converts the expression to a simple loop
for dmd the loop is vectorized
gdc/ldc rely on auto-vectorization
huge performance and latency improvements for some operations
small throughput decreases due to loop overhead with dmd's optimizer
and the floatArray[] / scalar change mentioned in the changelog
supports vectorization of more complex operations

latency

throughput

dlang-bot · 2017-07-26T11:04:31Z

Thanks for your pull request, @MartinNowak!

Bugzilla references

Auto-close	Bugzilla	Description
✗	7509	Allow SIMD variable contents to have all their values changed to a single float variable
✓	15619	[REG 2.066] Floating-point x86_64 codegen regression, when involving array ops
✓	16680	dmd doesn't use druntime optimized versions of subtraction array operations

⚠️⚠️⚠️ Warnings ⚠️⚠️⚠️

Regression or critical bug fixes should always target the stable branch. Learn more about rebasing to stable or the D release process.

MartinNowak · 2017-07-26T14:23:53Z

Jeez, got broken by dlang/dmd#7019.

ibuclaw · 2017-07-26T14:39:59Z

src/core/internal/arrayop.d

+else version (X86_64)
+    version = X86_OR_X64;
+else
+    static assert(0, "unimplemented");


Don't break ARM, please.

ibuclaw · 2017-07-26T14:57:15Z

src/core/internal/arrayop.d

+        {
+            storeUnaligned!vec(val, p);
+        }
+        else version (GNU)


You have version (GNU) inside DigitalMars

ibuclaw · 2017-07-26T15:00:36Z

src/core/internal/arrayop.d

+            else static if (is(T == double))
+                return __builtin_ia32_loadupd(p);
+            else
+                return __builtin_ia32_loaddqu(cast(const char*) p);


Oh, just a leftover, I removed the GNU/LDC stuff later on, when I first tested this a few month ago auto-vectorization didn't work too well.

Yeah, there's a very very limited set of conditions that allow it to generate small and optimal code. In the best case for parameters, you might at least have a branch generated that is ran if all conditions are met.

Otherwise I guess we could generate a generic simd load using:

auto v1 = *cast(float4*) a1.ptr;

Loop as necessary, then do single element operations after that.

*cast(float4*) a1.ptr

That would assume alignment to float4.alignof.

That would assume alignment to float4.alignof.

Right. I should have remembered the segfault in dmd I recently encountered. Because that is precisely what would happen. ;-)

ibuclaw · 2017-07-26T15:04:23Z

src/core/internal/arrayop.d

+        }
+    }
+    for (; pos < res.length; ++pos)
+        mixin(scalarExp!Args ~ ";");


Does this generate something comparable to what the compiler currently generates for fd->isArrayOp functions?

I checked, and the loop is identical. https://explore.dgnu.org/g/sNV7Bo 👍

(On an unrelated note, I should make gdc more smarter with its template emission strategy)

Yes, we should figure out which template instances are needed at runtime, affects all compilers and could likely be done in the frontend.

Two really important pieces of information that could be most beneficial are:

Was this instantiated in only by CTFE? We never need these.

Was this instantiated inside a function or module/class scope? With instantiations from a function it should be fine to discard inlined and unreferenced functions in this current compilation.

MartinNowak · 2017-07-27T07:53:15Z

A bit frustrating how much effort was only spent because of dmd's backend.

WalterBright · 2017-07-29T22:54:22Z

src/core/internal/traits.d

        enum bool hasElaborateCopyConstructor = false;
 }
+
+template Filter(alias pred, TList...)


Please add comment saying what Filter does.

WalterBright · 2017-07-29T22:55:41Z

src/object.d

    assert(__cmp([c2, c2], [c1, c1]) > 0);
 }

+template _arrayOp(Args...)


The purpose of this is mysterious. Is it to just forward to core.internal.arrayop.arrayOp ? If so, why not just use alias? Please add documentation comment.

It's a template so we don't unnecessarily import internal modules.

WalterBright · 2017-07-29T22:57:32Z

@ibuclaw is your review satisfied?

WalterBright · 2017-07-29T22:59:00Z

A bit frustrating how much effort was only spent because of dmd's backend.

Yeah, but your improvements made it worth the effort!

ibuclaw · 2017-07-30T08:10:30Z

@ibuclaw is your review satisfied?

Yes, I tested a copy of and it works on all gdc targets. Just the comment on the extra template bloat of entirely unreferenced functions. But that is a compiler concern and not a blocker for this.

@ibuclaw

Because @ibuclaw said his review was satisfied.

MartinNowak · 2017-07-31T11:41:32Z

Done @WalterBright

Yeah, but your improvements made it worth the effort!

Hardly, neither GDC nor LDC need any of this. The vector code in this PR (which required the various dmd backend fixes for SIMD) basically just does the same as auto-vectorization in GDC/LDC.

At least we can stop maintaining a huge amout of hand-written assembly code.

MartinNowak · 2017-08-03T15:29:41Z

Anything left @WalterBright, @ibuclaw?

ibuclaw · 2017-08-03T16:33:21Z

I have no problems with this.

ibuclaw · 2017-08-03T16:41:51Z

Do we have an open issue about template instantiation in general?

MartinNowak · 2017-08-03T21:03:01Z

Do we have an open issue about template instantiation in general?

It's not even clear whether it's an actual problem (and how big it is).
Issue 17719 – compiler generates code for CTFE-only templates
Intuitively I'd say the codegen for CTFE instantiations is less of a problem compared to the semantic speed of template instantiations in general. Lots of instantiations will be used at runtime, and distinguishing which are not used at runtime is not exactly trivial.

dnadlinger · 2017-08-03T23:03:39Z

--gc-sections is also a very effective band-aid for the issue (although compile speed of course still suffers).

ibuclaw · 2017-08-04T07:37:20Z

Lots of instantiations will be used at runtime, and distinguishing which are not used at runtime is not exactly trivial.

indeed. But I think we (possibly meaning I and @klickverbot, not dmd) could probably get away with just adding one new field that represents the least restrictive scope that a template was instantiated in. If an instantiation is only ever used inside a function, then I could allow my backend to discard inlined and/or unreferenced templates. However if it were instantiated inside a top-level type or module scope, then everything will need to be emitted.

Maybe I'm only thinking of contrived / simple examples though.

ibuclaw · 2017-08-04T07:39:23Z

@WalterBright - please review. :-)

nemanja-boric-sociomantic

small comment

nemanja-boric-sociomantic · 2017-08-04T08:50:00Z

src/core/internal/arrayop.d

+        enum vectorizeable = vectorizeableOps!E([Filter!(not!isType, Args)])
+                && compatibleVecTypes!(E, Filter!(isType, Args));
+    else
+                    enum vectorizeable = false;


indentation

It's a bug in dlang-community/dfmt#286, fixed manually for now.

- use RPN to encode operand precedence - fixes Issue 15619, and 16680

- properly sort/order values on abscissa

- support for targets specific vector ops (e.g. AVX vs. SSE2)

- with UDTs

- dmd got broadcast init with #6248

- seems to have made quite some improvements while that module was written - generated code for scalar loops and for vector loops ends up being almost identical, so it seems more reasonable to leave decisions completely to the auto-vectorizers.

- e.g. replacement of ary[] / scalar with weaker ary[] >> 1

WalterBright · 2017-08-07T20:09:48Z

src/core/internal/arrayop.d

+    return op.length == 2 && op[1] == '=' && isBinaryOp(op[0 .. 1]);
+}
+
+string scalarExp(Args...)()


Desperately needs documentation - for example, what is the format of the RPN string? What are the Args? Where does the RPN string come from?

WalterBright · 2017-08-09T09:33:37Z

src/core/internal/arrayop.d


+// Generate mixin expression to perform scalar arrayOp loop expression, assumes
+// `pos` to be the current slice index, `args` to contain operand values, and
+// `res` the target slice.


I don't see pos, args, or res in the parameter list or even in the function body. Also, when documenting parameters, please use Ddoc conventions, i.e. a Params: block.

It's just a small helper function that generates a mixin string for the only public method in this module _arrayOps.

MartinNowak force-pushed the arrayOps branch from d402529 to aa93d8c Compare July 26, 2017 11:10

dlang-bot added the Enhancement New functionality label Jul 26, 2017

MartinNowak mentioned this pull request Jul 26, 2017

convert array ops to library calls dlang/dmd#7032

Merged

ibuclaw previously requested changes Jul 26, 2017

View reviewed changes

ibuclaw reviewed Jul 26, 2017

View reviewed changes

MartinNowak force-pushed the arrayOps branch from aa93d8c to adca015 Compare July 26, 2017 20:22

WalterBright suggested changes Jul 29, 2017

View reviewed changes

MartinNowak force-pushed the arrayOps branch from adca015 to 42c0103 Compare July 31, 2017 11:40

MartinNowak force-pushed the arrayOps branch 2 times, most recently from c729481 to 9d04170 Compare August 1, 2017 11:46

ibuclaw approved these changes Aug 4, 2017

View reviewed changes

nemanja-boric-sociomantic reviewed Aug 4, 2017

View reviewed changes

MartinNowak added 3 commits August 7, 2017 15:49

implement templated array ops

184435f

- use RPN to encode operand precedence - fixes Issue 15619, and 16680

fix plotting of arrayops benchmark

84d49d1

- properly sort/order values on abscissa

change plot to relative numbers

f8dd223

MartinNowak added 8 commits August 7, 2017 15:49

switch to easier to read bar plot

973f3a2

vectorizable ops by introspection

549bc8b

- support for targets specific vector ops (e.g. AVX vs. SSE2)

proper error message for unsupported scalar ops

60d0eef

- with UDTs

remove Issue 7509/16488 workaround

4eaf500

- dmd got broadcast init with #6248

always use vec ops

9c5d83b

rely on auto-vectorizer for gdc/ldc

9d7faf9

- seems to have made quite some improvements while that module was written - generated code for scalar loops and for vector loops ends up being almost identical, so it seems more reasonable to leave decisions completely to the auto-vectorizers.

use __gshared scalar to avoid const-folding

69ff724

- e.g. replacement of ary[] / scalar with weaker ary[] >> 1

add changelog for templated array ops

6bdc5a4

MartinNowak force-pushed the arrayOps branch from 9d04170 to 6bdc5a4 Compare August 7, 2017 13:55

WalterBright reviewed Aug 7, 2017

View reviewed changes

more docs

aee45fb

WalterBright reviewed Aug 9, 2017

View reviewed changes

WalterBright approved these changes Aug 9, 2017

View reviewed changes

WalterBright added the auto-merge label Aug 9, 2017

dlang-bot merged commit bc16735 into dlang:master Aug 9, 2017

MartinNowak deleted the arrayOps branch August 9, 2017 10:56

MartinNowak mentioned this pull request Aug 10, 2017

add missing ^^ and ^^= arrayop support #1899

Merged

timotheecour mentioned this pull request Feb 19, 2018

squeeze out more performance in case of no pointer aliasing libmir/mir-blas#2

Closed

RazvanN7 mentioned this pull request Nov 10, 2020

Fix Issue 21110 - OOB memory access, safety violation #3267

Merged

Uh oh!

Conversation

MartinNowak commented Jul 26, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

latency

throughput

Uh oh!

dlang-bot commented Jul 26, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bugzilla references

⚠️⚠️⚠️ Warnings ⚠️⚠️⚠️

Uh oh!

MartinNowak commented Jul 26, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dnadlinger Jul 27, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ibuclaw Jul 26, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MartinNowak Jul 26, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MartinNowak commented Jul 27, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WalterBright commented Jul 29, 2017

Uh oh!

WalterBright commented Jul 29, 2017

Uh oh!

ibuclaw commented Jul 30, 2017

Uh oh!

MartinNowak commented Jul 31, 2017

Uh oh!

MartinNowak commented Aug 3, 2017

Uh oh!

ibuclaw commented Aug 3, 2017

Uh oh!

ibuclaw commented Aug 3, 2017

Uh oh!

MartinNowak commented Aug 3, 2017

Uh oh!

dnadlinger commented Aug 3, 2017

Uh oh!

ibuclaw commented Aug 4, 2017

Uh oh!

ibuclaw commented Aug 4, 2017

Uh oh!

nemanja-boric-sociomantic left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MartinNowak commented Jul 26, 2017 •

edited

Loading

dlang-bot commented Jul 26, 2017 •

edited

Loading

MartinNowak commented Jul 26, 2017 •

edited

Loading

dnadlinger Jul 27, 2017 •

edited

Loading

ibuclaw Jul 26, 2017 •

edited

Loading

MartinNowak Jul 26, 2017 •

edited

Loading