Hexagon: Added support for generating vgather instruction on versions >=v65 by aankit-ca · Pull Request #1 · aankit-ca/Halide

aankit-ca · 2018-08-13T17:58:52Z

No description provided.

pranavb-ca

This is good stuff and now this looks far better than before.

pranavb-ca · 2018-08-13T22:07:54Z

src/Expr.h

     * "local" in OpenCL, and "threadgroup" in metal. Can be shared
     * across GPU threads within the same block. */
    GPUShared,
+    Vtcm,


Minor Nit: For the name, please consider VTCM as opposed to Vtcm as it is an abbreviation as opposed to Heap, Stack, Register etc.

Changed the name to VTCM

pranavb-ca · 2018-08-13T22:08:42Z

src/CodeGen_Posix.cpp

            const string str_max_size = target.has_large_buffers() ? "2^63 - 1" : "2^31 - 1";
            user_error << "Total size for allocation " << name << " is constant but exceeds " << str_max_size << ".";
        } else if (memory_type == MemoryType::Heap ||
+                   memory_type == MemoryType::Vtcm ||


VTCM is HVX specific, please consider some asserts to for the target here.

Added an assert at function start

pranavb-ca · 2018-08-13T22:19:26Z

src/CodeGen_Posix.cpp

        if (new_expr.defined()) {
            allocation.ptr = codegen(new_expr);
        } else {
+            string malloc_nm = (memory_type == MemoryType::Vtcm) ?


There is a chance Dillon could suggest the following - Don't make this change yet, but just suggesting a possibility he could ask for the following.
Why not define new_expr as halide_vtcm_malloc and free_function as halide_vtcm_free in the Allocate node if the memory_type is VTCM.
void visit(const Allocate *op) internal_assert(!op->new_expr && free_function.empty() && "VTCM cannot have custom allocator and deallocators\n"); new_expr = Call::make(Int(32), "halide_vtcm_malloc", size); free_function = "halide_vtcm_free"

Yes Pranav. This is a possibility. But in a case the user wants to use VTCM memory as a scratch buffer in that case it is better to have the logic in CodeGen_Posix as HexagonOptimize file deals with mostly Hexagon Optimizations and all the logic for allocation is present in this create allocation itself. But I agree with your concern, Dillon might ask us to move this to the IRMutator.

I got that wrong last time. Understood about new_expr and free_functions much later. Should have seen the comment carefully.

pranavb-ca · 2018-08-13T22:33:39Z

src/CodeGen_Hexagon.cpp

+        int intrin_lanes = native_vector_bits()/op->type.bits();
+        // Cut up the indices into appropriately-sized pieces.
+        vector<Value *> results;
+        string suffix = (op->type.bits() == 16) ? ".h.h" : ".w.w";


Do you want to assert for other types?

added assert for 8 bits

pranavb-ca · 2018-08-13T22:42:33Z

src/HexagonOptimize.cpp

+//     1. out(x) = lut(foo(x)) -> vgather
+//     2. out(idx(x)) = foo(x) -> vscatter
+// For gathers out and lut should be in VTCM in a single page.
+class ScatterGatherGenerator : public IRMutator {


Please use IRMutator2. IRMutator is not preferred any more.

It was in the plan. The next commit should make this change too.

pranavb-ca · 2018-08-13T23:07:13Z

src/HexagonOptimize.cpp

+    Expr is_gather(const Load *op, const Expr dst_base, const Expr dst_index) {
+        Type ty = op->type;
+        const Allocate *alloc = allocations[op->name];
+        if (op->index.as<Ramp>() || !alloc ||


Please consider breaking this up like so
if (!alloc || alloc->memory_type != MemoryType::Vtcm) return Expr(); if (op->index.as<Ramp>()) return Expr(); internal_assert(is_one(op->predicate) && ty.is_vector() && !ty.bits() == 8);

I've changes this: (!alloc || alloc->memory_type != MemoryType::Vtcm) to a separate if block. Is the assert right thing there? If we don't have have a 1 predicate it's still not an error but we just don't have to do anything.

pranavb-ca · 2018-08-13T23:08:20Z

src/HexagonOptimize.cpp

+            return Expr();
+        }
+
+        Expr index = mutate(ty.bits()/8 * op->index);


pranavb-ca · 2018-08-13T23:10:25Z

src/HexagonOptimize.cpp

+        Expr new_index = mutate(cast(ty, index));
+
+        return Call::make(ty, "gather", {dst_base, dst_index, src, size-1, new_index},
+                          Call::PureIntrinsic);


Is PureIntrinsic correct? gather does affect memory and represents a scheduling barrier, right?

Changing to intrinsic, although we use setDoesNotAccessMemory() for both Intrinsic and PureIntrinsic.

pranavb-ca · 2018-08-13T23:15:29Z

src/runtime/hvx_128.ll

+define weak_odr <64 x i16> @halide.hexagon.vgather.h.h(i16* %dst_base_ptr, i32 %dst_index, i16* %src_16ptr, i32 %size, <64 x i16> %lut) nounwind uwtable {
+  %lut32 = bitcast <64 x i16> %lut to <32 x i32>
+  %src = ptrtoint i16* %src_16ptr to i32
+  %dst_base = ptrtoint i16* %dst_base_ptr to i32


We should consider using getelementptr for pointer arithmetic instead of ptrtoint because from my days of working on LLVM I remember that ptrtoint and inttoptr can throw off some optimizations. For now, don't make any changes until I talk to Krzysztof tomorrow to confirm this.

Hi Pranav, I was also wondering whether using GEP is wiser. Should i change this to GEP?

Yes, GEP is better.

Thanks a lot Pranav. Changing this to GEP.

pranavb-ca

Reviewed the sim changes this time.

pranavb-ca · 2018-08-15T16:10:34Z

src/runtime/hexagon_remote/sim_hap_vtcm.cpp

+#include <stdlib.h>
+
+bool vtcm_ready = false;
+const unsigned int TCM_BASE = 0xD800 << 16;


Did you get these constant values from the example that I had shared with you? Do we know it is ok to use these for our case?

Yes. I borrowed the values from the example. It's better to know about the origin for these numbers. Eric did not reply to the last email. I'll send the mail again to confirm this.

pranavb-ca · 2018-08-15T16:11:00Z

src/runtime/hexagon_remote/sim_hap_vtcm.cpp

+    unsigned int aa = 0;
+    unsigned int vg = 3; // Set valid and ignore asid
+    unsigned int page_size = 8; // 256KB
+    add_translation_extended(1, va, pa, page_size, xwru, cccc, asid, aa, vg);


From line 113 to 116, please add comments for what is is going on.

I did not add comments to the entire file. You'll find the comments in the next commit.

pranavb-ca · 2018-08-15T16:24:16Z

src/runtime/hexagon_remote/sim_hap_vtcm.cpp

+    Node* prev = NULL;
+    Node* curr = list;
+    while (curr) {
+        if (list->addr == addr) {


Shouldn't this be if (curr->add == addr)

Yes Pranav, this should be curr. Thanks for the catch.

pranavb-ca · 2018-08-15T16:25:57Z

src/runtime/hexagon_remote/sim_hap_vtcm.cpp

+}
+
+// Add and merge new node to list in sorted order.
+void addAndMerge(Node* &list, Node* a) {


Minor nit: It looks like addAndMerge is only needed for free_blocks. If so, why take a list argument? This is a minor point only though.

I just used this to give to flexibility to call addAndMerge on the used list as well if needed in future. I'll remove the argument and directly use the global variable.

aankit-ca

@pranavb-ca Hi Pranav, I've made changes as you requested. Please let me know if more changes are required.

aankit-ca · 2018-08-20T09:21:27Z

src/runtime/hexagon_remote/sim_hap_vtcm.cpp

+    Node* prev = NULL;
+    Node* curr = list;
+    while (curr) {
+        if (list->addr == addr) {


Yes Pranav, this should be curr. Thanks for the catch.

…>=v65

…onal comments, minor changes

…Hexagon Modified gather correctness test

2. Better failure message for using VTCM without v65.

* Let lerp lowering incorporate a final cast This lets it save a few instructions on x86 and arm. cast(UInt(16), lerp(some_u8s)) produces the following, before and after this PR Before: x86: vmovdqu (%r15,%r13), %xmm4 vpmovzxbw -2(%r15,%r13), %ymm5 vpxor %xmm0, %xmm4, %xmm6 vpmovzxbw %xmm6, %ymm6 vpmovzxbw -1(%r15,%r13), %ymm7 vpmullw %ymm6, %ymm5, %ymm5 vpmovzxbw %xmm4, %ymm4 vpmullw %ymm4, %ymm7, %ymm4 vpaddw %ymm4, %ymm5, %ymm4 vpaddw %ymm1, %ymm4, %ymm4 vpmulhuw %ymm2, %ymm4, %ymm4 vpsrlw $7, %ymm4, %ymm4 vpand %ymm3, %ymm4, %ymm4 vmovdqu %ymm4, (%rbx,%r13,2) addq $16, %r13 decq %r10 jne .LBB0_10 arm: ldr q0, [x17] ldur q2, [x17, #-1] ldur q1, [x17, #-2] subs x0, x0, #1 // =1 mvn v3.16b, v0.16b umull v4.8h, v2.8b, v0.8b umull2 v0.8h, v2.16b, v0.16b umlal v4.8h, v1.8b, v3.8b umlal2 v0.8h, v1.16b, v3.16b urshr v1.8h, v4.8h, halide#8 urshr v2.8h, v0.8h, halide#8 raddhn v1.8b, v1.8h, v4.8h raddhn v0.8b, v2.8h, v0.8h ushll v0.8h, v0.8b, #0 ushll v1.8h, v1.8b, #0 add x17, x17, halide#16 // =16 stp q1, q0, [x18, #-16] add x18, x18, halide#32 // =32 b.ne .LBB0_10 After: x86: vpmovzxbw -2(%r15,%r13), %ymm3 vmovdqu (%r15,%r13), %xmm4 vpxor %xmm0, %xmm4, %xmm5 vpmovzxbw %xmm5, %ymm5 vpmullw %ymm5, %ymm3, %ymm3 vpmovzxbw -1(%r15,%r13), %ymm5 vpmovzxbw %xmm4, %ymm4 vpmullw %ymm4, %ymm5, %ymm4 vpaddw %ymm4, %ymm3, %ymm3 vpaddw %ymm1, %ymm3, %ymm3 vpmulhuw %ymm2, %ymm3, %ymm3 vpsrlw $7, %ymm3, %ymm3 vmovdqu %ymm3, (%rbp,%r13,2) addq $16, %r13 decq %r10 jne .LBB0_10 arm: ldr q0, [x17] ldur q2, [x17, #-1] ldur q1, [x17, #-2] subs x0, x0, #1 // =1 mvn v3.16b, v0.16b umull v4.8h, v2.8b, v0.8b umull2 v0.8h, v2.16b, v0.16b umlal v4.8h, v1.8b, v3.8b umlal2 v0.8h, v1.16b, v3.16b ursra v4.8h, v4.8h, halide#8 ursra v0.8h, v0.8h, halide#8 urshr v1.8h, v4.8h, halide#8 urshr v0.8h, v0.8h, halide#8 add x17, x17, halide#16 // =16 stp q1, q0, [x18, #-16] add x18, x18, halide#32 // =32 b.ne .LBB0_10 So on X86 we skip a pointless and instruction, and on ARM we get a rounding add and shift right instead of a rounding narrowing add shift right followed by a widen. * Add test * Fix bug in test * Don't produce out-of-range lerp values

* add_requirement() maintenance This PR started out as a quick fix to add Python bindings for the `add_requirements` methods on Pipeline and Generator (which were missing), but expanded a bit to fix other issues as well: - The implementation of `Generator::add_requirement` was subtly wrong, in that it only worked if you called the method after everything else in your `generate()` method. Now we accumulate requirements and insert them at the end, so you can call the method anywhere. - We had C++ methods that took both an explicit `vector<Expr>` and also a variadic-template version, but the former required a mutable vector... and fixing this to not require that ended up creating ambiguity about which overloaded call to use. Added an ugly enable_if thing to resolve this. (Side note #1: overloading methods to have both templated and non-templated versions with the same name is probably something to avoid in the future.) (Side note #2: we should probably thing more carefully about using variadic templates in our public API in the future; we currently use it pretty heavily, but it tends to be messy and hard to reason about IMHO.) * tidy * remove underscores

pranavb-ca suggested changes Aug 13, 2018

View reviewed changes

pranavb-ca suggested changes Aug 15, 2018

View reviewed changes

aankit-ca commented Aug 20, 2018

View reviewed changes

aankit-ca self-assigned this Aug 20, 2018

Ankit Aggarwal added 5 commits September 7, 2018 14:04

Hexagon: Added support for generating vgather instruction on versions…

75c0315

…>=v65

Adding asserts, changing to Scatter-Gather pass to IRMutator2, additi…

e18da49

…onal comments, minor changes

Newline at file end in mini_qurt_vtcm.h

f53795a

Cleaned gather correctness test

65f22a4

Modified/Added comments, moved malloc-free logic for VTCM to CodeGen_…

227bac4

…Hexagon Modified gather correctness test

aankit-ca force-pushed the hexagon_vgather branch from aa221c9 to 227bac4 Compare September 8, 2018 17:30

Ankit Aggarwal added 10 commits September 9, 2018 03:10

Remove sim_hap_vtcm.cpp. Was renamed to sim_qurt_vtcm.cpp

b54a9ec

Added Padding and minor correction in test/correctness/gather.cpp

1198ce2

Minor change in gather.cpp

bf444ae

Minor change in CodeGen_Hexagon

91f19f9

Minor Fixes

56bdbdb

Added qurt_hvx_vtcm in src/CMakelists.

977afee

Undo Comment

99745c0

Ramp check missing for gathers

f777721

Corrected is_intrinsic(gather) in CodeGen_Hexagon.cpp

fd0bef0

1. Added v65 feature check before store_in VTCM.

7c8355c

2. Better failure message for using VTCM without v65.

aankit-ca merged commit 7c8355c into master Oct 3, 2018

Conversation

aankit-ca commented Aug 13, 2018

Uh oh!

pranavb-ca left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pranavb-ca left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aankit-ca left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants