Skip to content

Hexagon: Added support for generating vgather instruction on versions >=v65#1

Merged
aankit-ca merged 15 commits intomasterfrom
hexagon_vgather
Oct 3, 2018
Merged

Hexagon: Added support for generating vgather instruction on versions >=v65#1
aankit-ca merged 15 commits intomasterfrom
hexagon_vgather

Conversation

@aankit-ca
Copy link
Owner

No description provided.

Copy link

@pranavb-ca pranavb-ca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good stuff and now this looks far better than before.

src/Expr.h Outdated
* "local" in OpenCL, and "threadgroup" in metal. Can be shared
* across GPU threads within the same block. */
GPUShared,
Vtcm,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor Nit: For the name, please consider VTCM as opposed to Vtcm as it is an abbreviation as opposed to Heap, Stack, Register etc.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the name to VTCM

const string str_max_size = target.has_large_buffers() ? "2^63 - 1" : "2^31 - 1";
user_error << "Total size for allocation " << name << " is constant but exceeds " << str_max_size << ".";
} else if (memory_type == MemoryType::Heap ||
memory_type == MemoryType::Vtcm ||

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VTCM is HVX specific, please consider some asserts to for the target here.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an assert at function start

if (new_expr.defined()) {
allocation.ptr = codegen(new_expr);
} else {
string malloc_nm = (memory_type == MemoryType::Vtcm) ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a chance Dillon could suggest the following - Don't make this change yet, but just suggesting a possibility he could ask for the following.
Why not define new_expr as halide_vtcm_malloc and free_function as halide_vtcm_free in the Allocate node if the memory_type is VTCM.
void visit(const Allocate *op) internal_assert(!op->new_expr && free_function.empty() && "VTCM cannot have custom allocator and deallocators\n"); new_expr = Call::make(Int(32), "halide_vtcm_malloc", size); free_function = "halide_vtcm_free"

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes Pranav. This is a possibility. But in a case the user wants to use VTCM memory as a scratch buffer in that case it is better to have the logic in CodeGen_Posix as HexagonOptimize file deals with mostly Hexagon Optimizations and all the logic for allocation is present in this create allocation itself. But I agree with your concern, Dillon might ask us to move this to the IRMutator.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got that wrong last time. Understood about new_expr and free_functions much later. Should have seen the comment carefully.

int intrin_lanes = native_vector_bits()/op->type.bits();
// Cut up the indices into appropriately-sized pieces.
vector<Value *> results;
string suffix = (op->type.bits() == 16) ? ".h.h" : ".w.w";

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to assert for other types?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added assert for 8 bits

// 1. out(x) = lut(foo(x)) -> vgather
// 2. out(idx(x)) = foo(x) -> vscatter
// For gathers out and lut should be in VTCM in a single page.
class ScatterGatherGenerator : public IRMutator {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use IRMutator2. IRMutator is not preferred any more.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was in the plan. The next commit should make this change too.

Expr is_gather(const Load *op, const Expr dst_base, const Expr dst_index) {
Type ty = op->type;
const Allocate *alloc = allocations[op->name];
if (op->index.as<Ramp>() || !alloc ||

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider breaking this up like so
if (!alloc || alloc->memory_type != MemoryType::Vtcm) return Expr(); if (op->index.as<Ramp>()) return Expr(); internal_assert(is_one(op->predicate) && ty.is_vector() && !ty.bits() == 8);

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changes this: (!alloc || alloc->memory_type != MemoryType::Vtcm) to a separate if block. Is the assert right thing there? If we don't have have a 1 predicate it's still not an error but we just don't have to do anything.

return Expr();
}

Expr index = mutate(ty.bits()/8 * op->index);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ty.bytes()

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Expr new_index = mutate(cast(ty, index));

return Call::make(ty, "gather", {dst_base, dst_index, src, size-1, new_index},
Call::PureIntrinsic);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is PureIntrinsic correct? gather does affect memory and represents a scheduling barrier, right?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing to intrinsic, although we use setDoesNotAccessMemory() for both Intrinsic and PureIntrinsic.

define weak_odr <64 x i16> @halide.hexagon.vgather.h.h(i16* %dst_base_ptr, i32 %dst_index, i16* %src_16ptr, i32 %size, <64 x i16> %lut) nounwind uwtable {
%lut32 = bitcast <64 x i16> %lut to <32 x i32>
%src = ptrtoint i16* %src_16ptr to i32
%dst_base = ptrtoint i16* %dst_base_ptr to i32

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider using getelementptr for pointer arithmetic instead of ptrtoint because from my days of working on LLVM I remember that ptrtoint and inttoptr can throw off some optimizations. For now, don't make any changes until I talk to Krzysztof tomorrow to confirm this.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Pranav, I was also wondering whether using GEP is wiser. Should i change this to GEP?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, GEP is better.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot Pranav. Changing this to GEP.

Copy link

@pranavb-ca pranavb-ca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the sim changes this time.

#include <stdlib.h>

bool vtcm_ready = false;
const unsigned int TCM_BASE = 0xD800 << 16;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you get these constant values from the example that I had shared with you? Do we know it is ok to use these for our case?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I borrowed the values from the example. It's better to know about the origin for these numbers. Eric did not reply to the last email. I'll send the mail again to confirm this.

unsigned int aa = 0;
unsigned int vg = 3; // Set valid and ignore asid
unsigned int page_size = 8; // 256KB
add_translation_extended(1, va, pa, page_size, xwru, cccc, asid, aa, vg);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From line 113 to 116, please add comments for what is is going on.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not add comments to the entire file. You'll find the comments in the next commit.

Node* prev = NULL;
Node* curr = list;
while (curr) {
if (list->addr == addr) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be if (curr->add == addr)

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes Pranav, this should be curr. Thanks for the catch.

}

// Add and merge new node to list in sorted order.
void addAndMerge(Node* &list, Node* a) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: It looks like addAndMerge is only needed for free_blocks. If so, why take a list argument? This is a minor point only though.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just used this to give to flexibility to call addAndMerge on the used list as well if needed in future. I'll remove the argument and directly use the global variable.

Copy link
Owner Author

@aankit-ca aankit-ca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pranavb-ca Hi Pranav, I've made changes as you requested. Please let me know if more changes are required.

Node* prev = NULL;
Node* curr = list;
while (curr) {
if (list->addr == addr) {
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes Pranav, this should be curr. Thanks for the catch.

@aankit-ca aankit-ca self-assigned this Aug 20, 2018
@aankit-ca aankit-ca merged commit 7c8355c into master Oct 3, 2018
aankit-ca pushed a commit that referenced this pull request Oct 25, 2022
* Let lerp lowering incorporate a final cast

This lets it save a few instructions on x86 and arm.

cast(UInt(16), lerp(some_u8s)) produces the following, before and after
this PR

Before:

x86:

	vmovdqu	(%r15,%r13), %xmm4
	vpmovzxbw	-2(%r15,%r13), %ymm5
	vpxor	%xmm0, %xmm4, %xmm6
	vpmovzxbw	%xmm6, %ymm6
	vpmovzxbw	-1(%r15,%r13), %ymm7
	vpmullw	%ymm6, %ymm5, %ymm5
	vpmovzxbw	%xmm4, %ymm4
	vpmullw	%ymm4, %ymm7, %ymm4
	vpaddw	%ymm4, %ymm5, %ymm4
	vpaddw	%ymm1, %ymm4, %ymm4
	vpmulhuw	%ymm2, %ymm4, %ymm4
	vpsrlw	$7, %ymm4, %ymm4
	vpand	%ymm3, %ymm4, %ymm4
	vmovdqu	%ymm4, (%rbx,%r13,2)
	addq	$16, %r13
	decq	%r10
	jne	.LBB0_10
arm:

	ldr	q0, [x17]
	ldur	q2, [x17, #-1]
	ldur	q1, [x17, #-2]
	subs	x0, x0, #1                      // =1
	mvn	v3.16b, v0.16b
	umull	v4.8h, v2.8b, v0.8b
	umull2	v0.8h, v2.16b, v0.16b
	umlal	v4.8h, v1.8b, v3.8b
	umlal2	v0.8h, v1.16b, v3.16b
	urshr	v1.8h, v4.8h, halide#8
	urshr	v2.8h, v0.8h, halide#8
	raddhn	v1.8b, v1.8h, v4.8h
	raddhn	v0.8b, v2.8h, v0.8h
	ushll	v0.8h, v0.8b, #0
	ushll	v1.8h, v1.8b, #0
	add	x17, x17, halide#16                   // =16
	stp	q1, q0, [x18, #-16]
	add	x18, x18, halide#32                   // =32
	b.ne	.LBB0_10

After:

x86:

	vpmovzxbw	-2(%r15,%r13), %ymm3
	vmovdqu	(%r15,%r13), %xmm4
	vpxor	%xmm0, %xmm4, %xmm5
	vpmovzxbw	%xmm5, %ymm5
	vpmullw	%ymm5, %ymm3, %ymm3
	vpmovzxbw	-1(%r15,%r13), %ymm5
	vpmovzxbw	%xmm4, %ymm4
	vpmullw	%ymm4, %ymm5, %ymm4
	vpaddw	%ymm4, %ymm3, %ymm3
	vpaddw	%ymm1, %ymm3, %ymm3
	vpmulhuw	%ymm2, %ymm3, %ymm3
	vpsrlw	$7, %ymm3, %ymm3
	vmovdqu	%ymm3, (%rbp,%r13,2)
	addq	$16, %r13
	decq	%r10
	jne	.LBB0_10

arm:

	ldr	q0, [x17]
	ldur	q2, [x17, #-1]
	ldur	q1, [x17, #-2]
	subs	x0, x0, #1                      // =1
	mvn	v3.16b, v0.16b
	umull	v4.8h, v2.8b, v0.8b
	umull2	v0.8h, v2.16b, v0.16b
	umlal	v4.8h, v1.8b, v3.8b
	umlal2	v0.8h, v1.16b, v3.16b
	ursra	v4.8h, v4.8h, halide#8
	ursra	v0.8h, v0.8h, halide#8
	urshr	v1.8h, v4.8h, halide#8
	urshr	v0.8h, v0.8h, halide#8
	add	x17, x17, halide#16                   // =16
	stp	q1, q0, [x18, #-16]
	add	x18, x18, halide#32                   // =32
	b.ne	.LBB0_10

So on X86 we skip a pointless and instruction, and on ARM we get a
rounding add and shift right instead of a rounding narrowing add shift
right followed by a widen.

* Add test

* Fix bug in test

* Don't produce out-of-range lerp values
aankit-ca pushed a commit that referenced this pull request Oct 25, 2022
* add_requirement() maintenance

This PR started out as a quick fix to add Python bindings for the `add_requirements` methods on Pipeline and Generator (which were missing), but expanded a bit to fix other issues as well:
- The implementation of `Generator::add_requirement` was subtly wrong, in that it only worked if you called the method after everything else in your `generate()` method. Now we accumulate requirements and insert them at the end, so you can call the method anywhere.
- We had C++ methods that took both an explicit `vector<Expr>` and also a variadic-template version, but the former required a mutable vector... and fixing this to not require that ended up creating ambiguity about which overloaded call to use. Added an ugly enable_if thing to resolve this.

(Side note #1: overloading methods to have both templated and non-templated versions with the same name is probably something to avoid in the future.)

(Side note #2: we should probably thing more carefully about using variadic templates in our public API in the future; we currently use it pretty heavily, but it tends to be messy and hard to reason about IMHO.)

* tidy

* remove underscores
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants