Skip to content
This repository was archived by the owner on Oct 12, 2022. It is now read-only.
/ druntime Public archive

templated array ops#1891

Merged
dlang-bot merged 12 commits intodlang:masterfrom
MartinNowak:arrayOps
Aug 9, 2017
Merged

templated array ops#1891
dlang-bot merged 12 commits intodlang:masterfrom
MartinNowak:arrayOps

Conversation

@MartinNowak
Copy link
Member

@MartinNowak MartinNowak commented Jul 26, 2017

  • runtime implementation of templated array operations w/ vectorization
  • expresion tree is passed in RPN to encode operator precendence
  • converts the expression to a simple loop
  • for dmd the loop is vectorized
  • gdc/ldc rely on auto-vectorization
  • huge performance and latency improvements for some operations
  • small throughput decreases due to loop overhead with dmd's optimizer
    and the floatArray[] / scalar change mentioned in the changelog
  • supports vectorization of more complex operations

latency

array_ops_latency

throughput

array_ops_throughput

@dlang-bot
Copy link
Contributor

dlang-bot commented Jul 26, 2017

Thanks for your pull request, @MartinNowak!

Bugzilla references

Auto-close Bugzilla Description
7509 Allow SIMD variable contents to have all their values changed to a single float variable
15619 [REG 2.066] Floating-point x86_64 codegen regression, when involving array ops
16680 dmd doesn't use druntime optimized versions of subtraction array operations

⚠️⚠️⚠️ Warnings ⚠️⚠️⚠️

@MartinNowak
Copy link
Member Author

MartinNowak commented Jul 26, 2017

Jeez, got broken by dlang/dmd#7019.

else version (X86_64)
version = X86_OR_X64;
else
static assert(0, "unimplemented");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't break ARM, please.

{
storeUnaligned!vec(val, p);
}
else version (GNU)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have version (GNU) inside DigitalMars

else static if (is(T == double))
return __builtin_ia32_loadupd(p);
else
return __builtin_ia32_loaddqu(cast(const char*) p);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, just a leftover, I removed the GNU/LDC stuff later on, when I first tested this a few month ago auto-vectorization didn't work too well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there's a very very limited set of conditions that allow it to generate small and optimal code. In the best case for parameters, you might at least have a branch generated that is ran if all conditions are met.

Otherwise I guess we could generate a generic simd load using:

auto v1 = *cast(float4*) a1.ptr;

Loop as necessary, then do single element operations after that.

Copy link
Contributor

@dnadlinger dnadlinger Jul 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*cast(float4*) a1.ptr

That would assume alignment to float4.alignof.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would assume alignment to float4.alignof.

Right. I should have remembered the segfault in dmd I recently encountered. Because that is precisely what would happen. ;-)

}
}
for (; pos < res.length; ++pos)
mixin(scalarExp!Args ~ ";");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this generate something comparable to what the compiler currently generates for fd->isArrayOp functions?

Copy link
Member

@ibuclaw ibuclaw Jul 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked, and the loop is identical. https://explore.dgnu.org/g/sNV7Bo 👍

(On an unrelated note, I should make gdc more smarter with its template emission strategy)

Copy link
Member Author

@MartinNowak MartinNowak Jul 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should figure out which template instances are needed at runtime, affects all compilers and could likely be done in the frontend.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two really important pieces of information that could be most beneficial are:

  1. Was this instantiated in only by CTFE? We never need these.
  2. Was this instantiated inside a function or module/class scope? With instantiations from a function it should be fine to discard inlined and unreferenced functions in this current compilation.

@MartinNowak
Copy link
Member Author

A bit frustrating how much effort was only spent because of dmd's backend.

enum bool hasElaborateCopyConstructor = false;
}

template Filter(alias pred, TList...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add comment saying what Filter does.

assert(__cmp([c2, c2], [c1, c1]) > 0);
}

template _arrayOp(Args...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of this is mysterious. Is it to just forward to core.internal.arrayop.arrayOp ? If so, why not just use alias? Please add documentation comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a template so we don't unnecessarily import internal modules.

@WalterBright
Copy link
Member

@ibuclaw is your review satisfied?

@WalterBright
Copy link
Member

A bit frustrating how much effort was only spent because of dmd's backend.

Yeah, but your improvements made it worth the effort!

@ibuclaw
Copy link
Member

ibuclaw commented Jul 30, 2017

@ibuclaw is your review satisfied?

Yes, I tested a copy of and it works on all gdc targets. Just the comment on the extra template bloat of entirely unreferenced functions. But that is a compiler concern and not a blocker for this.

@WalterBright WalterBright dismissed ibuclaw’s stale review July 30, 2017 10:10

Because @ibuclaw said his review was satisfied.

@MartinNowak
Copy link
Member Author

Done @WalterBright

Yeah, but your improvements made it worth the effort!

Hardly, neither GDC nor LDC need any of this. The vector code in this PR (which required the various dmd backend fixes for SIMD) basically just does the same as auto-vectorization in GDC/LDC.

At least we can stop maintaining a huge amout of hand-written assembly code.

@MartinNowak MartinNowak force-pushed the arrayOps branch 2 times, most recently from c729481 to 9d04170 Compare August 1, 2017 11:46
@MartinNowak
Copy link
Member Author

Anything left @WalterBright, @ibuclaw?

@ibuclaw
Copy link
Member

ibuclaw commented Aug 3, 2017

I have no problems with this.

@ibuclaw
Copy link
Member

ibuclaw commented Aug 3, 2017

Do we have an open issue about template instantiation in general?

@MartinNowak
Copy link
Member Author

Do we have an open issue about template instantiation in general?

It's not even clear whether it's an actual problem (and how big it is).
Issue 17719 – compiler generates code for CTFE-only templates
Intuitively I'd say the codegen for CTFE instantiations is less of a problem compared to the semantic speed of template instantiations in general. Lots of instantiations will be used at runtime, and distinguishing which are not used at runtime is not exactly trivial.

@dnadlinger
Copy link
Contributor

--gc-sections is also a very effective band-aid for the issue (although compile speed of course still suffers).

@ibuclaw
Copy link
Member

ibuclaw commented Aug 4, 2017

Lots of instantiations will be used at runtime, and distinguishing which are not used at runtime is not exactly trivial.

indeed. But I think we (possibly meaning I and @klickverbot, not dmd) could probably get away with just adding one new field that represents the least restrictive scope that a template was instantiated in. If an instantiation is only ever used inside a function, then I could allow my backend to discard inlined and/or unreferenced templates. However if it were instantiated inside a top-level type or module scope, then everything will need to be emitted.

Maybe I'm only thinking of contrived / simple examples though.

@ibuclaw
Copy link
Member

ibuclaw commented Aug 4, 2017

@WalterBright - please review. :-)

Copy link
Contributor

@nemanja-boric-sociomantic nemanja-boric-sociomantic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comment

enum vectorizeable = vectorizeableOps!E([Filter!(not!isType, Args)])
&& compatibleVecTypes!(E, Filter!(isType, Args));
else
enum vectorizeable = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bug in dlang-community/dfmt#286, fixed manually for now.

- use RPN to encode operand precedence
- fixes Issue 15619, and 16680
- properly sort/order values on abscissa
- support for targets specific vector ops (e.g. AVX vs. SSE2)
- dmd got broadcast init with #6248
- seems to have made quite some improvements while that module was written
- generated code for scalar loops and for vector loops ends up being almost identical,
  so it seems more reasonable to leave decisions completely to the auto-vectorizers.
- e.g. replacement of ary[] / scalar with weaker ary[] >> 1
return op.length == 2 && op[1] == '=' && isBinaryOp(op[0 .. 1]);
}

string scalarExp(Args...)()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Desperately needs documentation - for example, what is the format of the RPN string? What are the Args? Where does the RPN string come from?


// Generate mixin expression to perform scalar arrayOp loop expression, assumes
// `pos` to be the current slice index, `args` to contain operand values, and
// `res` the target slice.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see pos, args, or res in the parameter list or even in the function body. Also, when documenting parameters, please use Ddoc conventions, i.e. a Params: block.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just a small helper function that generates a mixin string for the only public method in this module _arrayOps.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Enhancement New functionality

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants