Hexagon: Added support for generating vgather instruction on versions >=v65#1
Hexagon: Added support for generating vgather instruction on versions >=v65#1
Conversation
pranavb-ca
left a comment
There was a problem hiding this comment.
This is good stuff and now this looks far better than before.
src/Expr.h
Outdated
| * "local" in OpenCL, and "threadgroup" in metal. Can be shared | ||
| * across GPU threads within the same block. */ | ||
| GPUShared, | ||
| Vtcm, |
There was a problem hiding this comment.
Minor Nit: For the name, please consider VTCM as opposed to Vtcm as it is an abbreviation as opposed to Heap, Stack, Register etc.
There was a problem hiding this comment.
Changed the name to VTCM
src/CodeGen_Posix.cpp
Outdated
| const string str_max_size = target.has_large_buffers() ? "2^63 - 1" : "2^31 - 1"; | ||
| user_error << "Total size for allocation " << name << " is constant but exceeds " << str_max_size << "."; | ||
| } else if (memory_type == MemoryType::Heap || | ||
| memory_type == MemoryType::Vtcm || |
There was a problem hiding this comment.
VTCM is HVX specific, please consider some asserts to for the target here.
There was a problem hiding this comment.
Added an assert at function start
src/CodeGen_Posix.cpp
Outdated
| if (new_expr.defined()) { | ||
| allocation.ptr = codegen(new_expr); | ||
| } else { | ||
| string malloc_nm = (memory_type == MemoryType::Vtcm) ? |
There was a problem hiding this comment.
There is a chance Dillon could suggest the following - Don't make this change yet, but just suggesting a possibility he could ask for the following.
Why not define new_expr as halide_vtcm_malloc and free_function as halide_vtcm_free in the Allocate node if the memory_type is VTCM.
void visit(const Allocate *op) internal_assert(!op->new_expr && free_function.empty() && "VTCM cannot have custom allocator and deallocators\n"); new_expr = Call::make(Int(32), "halide_vtcm_malloc", size); free_function = "halide_vtcm_free"
There was a problem hiding this comment.
Yes Pranav. This is a possibility. But in a case the user wants to use VTCM memory as a scratch buffer in that case it is better to have the logic in CodeGen_Posix as HexagonOptimize file deals with mostly Hexagon Optimizations and all the logic for allocation is present in this create allocation itself. But I agree with your concern, Dillon might ask us to move this to the IRMutator.
There was a problem hiding this comment.
I got that wrong last time. Understood about new_expr and free_functions much later. Should have seen the comment carefully.
src/CodeGen_Hexagon.cpp
Outdated
| int intrin_lanes = native_vector_bits()/op->type.bits(); | ||
| // Cut up the indices into appropriately-sized pieces. | ||
| vector<Value *> results; | ||
| string suffix = (op->type.bits() == 16) ? ".h.h" : ".w.w"; |
There was a problem hiding this comment.
Do you want to assert for other types?
There was a problem hiding this comment.
added assert for 8 bits
src/HexagonOptimize.cpp
Outdated
| // 1. out(x) = lut(foo(x)) -> vgather | ||
| // 2. out(idx(x)) = foo(x) -> vscatter | ||
| // For gathers out and lut should be in VTCM in a single page. | ||
| class ScatterGatherGenerator : public IRMutator { |
There was a problem hiding this comment.
Please use IRMutator2. IRMutator is not preferred any more.
There was a problem hiding this comment.
It was in the plan. The next commit should make this change too.
src/HexagonOptimize.cpp
Outdated
| Expr is_gather(const Load *op, const Expr dst_base, const Expr dst_index) { | ||
| Type ty = op->type; | ||
| const Allocate *alloc = allocations[op->name]; | ||
| if (op->index.as<Ramp>() || !alloc || |
There was a problem hiding this comment.
Please consider breaking this up like so
if (!alloc || alloc->memory_type != MemoryType::Vtcm) return Expr(); if (op->index.as<Ramp>()) return Expr(); internal_assert(is_one(op->predicate) && ty.is_vector() && !ty.bits() == 8);
There was a problem hiding this comment.
I've changes this: (!alloc || alloc->memory_type != MemoryType::Vtcm) to a separate if block. Is the assert right thing there? If we don't have have a 1 predicate it's still not an error but we just don't have to do anything.
src/HexagonOptimize.cpp
Outdated
| return Expr(); | ||
| } | ||
|
|
||
| Expr index = mutate(ty.bits()/8 * op->index); |
src/HexagonOptimize.cpp
Outdated
| Expr new_index = mutate(cast(ty, index)); | ||
|
|
||
| return Call::make(ty, "gather", {dst_base, dst_index, src, size-1, new_index}, | ||
| Call::PureIntrinsic); |
There was a problem hiding this comment.
Is PureIntrinsic correct? gather does affect memory and represents a scheduling barrier, right?
There was a problem hiding this comment.
Changing to intrinsic, although we use setDoesNotAccessMemory() for both Intrinsic and PureIntrinsic.
src/runtime/hvx_128.ll
Outdated
| define weak_odr <64 x i16> @halide.hexagon.vgather.h.h(i16* %dst_base_ptr, i32 %dst_index, i16* %src_16ptr, i32 %size, <64 x i16> %lut) nounwind uwtable { | ||
| %lut32 = bitcast <64 x i16> %lut to <32 x i32> | ||
| %src = ptrtoint i16* %src_16ptr to i32 | ||
| %dst_base = ptrtoint i16* %dst_base_ptr to i32 |
There was a problem hiding this comment.
We should consider using getelementptr for pointer arithmetic instead of ptrtoint because from my days of working on LLVM I remember that ptrtoint and inttoptr can throw off some optimizations. For now, don't make any changes until I talk to Krzysztof tomorrow to confirm this.
There was a problem hiding this comment.
Hi Pranav, I was also wondering whether using GEP is wiser. Should i change this to GEP?
There was a problem hiding this comment.
Thanks a lot Pranav. Changing this to GEP.
pranavb-ca
left a comment
There was a problem hiding this comment.
Reviewed the sim changes this time.
| #include <stdlib.h> | ||
|
|
||
| bool vtcm_ready = false; | ||
| const unsigned int TCM_BASE = 0xD800 << 16; |
There was a problem hiding this comment.
Did you get these constant values from the example that I had shared with you? Do we know it is ok to use these for our case?
There was a problem hiding this comment.
Yes. I borrowed the values from the example. It's better to know about the origin for these numbers. Eric did not reply to the last email. I'll send the mail again to confirm this.
| unsigned int aa = 0; | ||
| unsigned int vg = 3; // Set valid and ignore asid | ||
| unsigned int page_size = 8; // 256KB | ||
| add_translation_extended(1, va, pa, page_size, xwru, cccc, asid, aa, vg); |
There was a problem hiding this comment.
From line 113 to 116, please add comments for what is is going on.
There was a problem hiding this comment.
I did not add comments to the entire file. You'll find the comments in the next commit.
| Node* prev = NULL; | ||
| Node* curr = list; | ||
| while (curr) { | ||
| if (list->addr == addr) { |
There was a problem hiding this comment.
Shouldn't this be if (curr->add == addr)
There was a problem hiding this comment.
Yes Pranav, this should be curr. Thanks for the catch.
| } | ||
|
|
||
| // Add and merge new node to list in sorted order. | ||
| void addAndMerge(Node* &list, Node* a) { |
There was a problem hiding this comment.
Minor nit: It looks like addAndMerge is only needed for free_blocks. If so, why take a list argument? This is a minor point only though.
There was a problem hiding this comment.
I just used this to give to flexibility to call addAndMerge on the used list as well if needed in future. I'll remove the argument and directly use the global variable.
aankit-ca
left a comment
There was a problem hiding this comment.
@pranavb-ca Hi Pranav, I've made changes as you requested. Please let me know if more changes are required.
| Node* prev = NULL; | ||
| Node* curr = list; | ||
| while (curr) { | ||
| if (list->addr == addr) { |
There was a problem hiding this comment.
Yes Pranav, this should be curr. Thanks for the catch.
…onal comments, minor changes
…Hexagon Modified gather correctness test
aa221c9 to
227bac4
Compare
2. Better failure message for using VTCM without v65.
* Let lerp lowering incorporate a final cast This lets it save a few instructions on x86 and arm. cast(UInt(16), lerp(some_u8s)) produces the following, before and after this PR Before: x86: vmovdqu (%r15,%r13), %xmm4 vpmovzxbw -2(%r15,%r13), %ymm5 vpxor %xmm0, %xmm4, %xmm6 vpmovzxbw %xmm6, %ymm6 vpmovzxbw -1(%r15,%r13), %ymm7 vpmullw %ymm6, %ymm5, %ymm5 vpmovzxbw %xmm4, %ymm4 vpmullw %ymm4, %ymm7, %ymm4 vpaddw %ymm4, %ymm5, %ymm4 vpaddw %ymm1, %ymm4, %ymm4 vpmulhuw %ymm2, %ymm4, %ymm4 vpsrlw $7, %ymm4, %ymm4 vpand %ymm3, %ymm4, %ymm4 vmovdqu %ymm4, (%rbx,%r13,2) addq $16, %r13 decq %r10 jne .LBB0_10 arm: ldr q0, [x17] ldur q2, [x17, #-1] ldur q1, [x17, #-2] subs x0, x0, #1 // =1 mvn v3.16b, v0.16b umull v4.8h, v2.8b, v0.8b umull2 v0.8h, v2.16b, v0.16b umlal v4.8h, v1.8b, v3.8b umlal2 v0.8h, v1.16b, v3.16b urshr v1.8h, v4.8h, halide#8 urshr v2.8h, v0.8h, halide#8 raddhn v1.8b, v1.8h, v4.8h raddhn v0.8b, v2.8h, v0.8h ushll v0.8h, v0.8b, #0 ushll v1.8h, v1.8b, #0 add x17, x17, halide#16 // =16 stp q1, q0, [x18, #-16] add x18, x18, halide#32 // =32 b.ne .LBB0_10 After: x86: vpmovzxbw -2(%r15,%r13), %ymm3 vmovdqu (%r15,%r13), %xmm4 vpxor %xmm0, %xmm4, %xmm5 vpmovzxbw %xmm5, %ymm5 vpmullw %ymm5, %ymm3, %ymm3 vpmovzxbw -1(%r15,%r13), %ymm5 vpmovzxbw %xmm4, %ymm4 vpmullw %ymm4, %ymm5, %ymm4 vpaddw %ymm4, %ymm3, %ymm3 vpaddw %ymm1, %ymm3, %ymm3 vpmulhuw %ymm2, %ymm3, %ymm3 vpsrlw $7, %ymm3, %ymm3 vmovdqu %ymm3, (%rbp,%r13,2) addq $16, %r13 decq %r10 jne .LBB0_10 arm: ldr q0, [x17] ldur q2, [x17, #-1] ldur q1, [x17, #-2] subs x0, x0, #1 // =1 mvn v3.16b, v0.16b umull v4.8h, v2.8b, v0.8b umull2 v0.8h, v2.16b, v0.16b umlal v4.8h, v1.8b, v3.8b umlal2 v0.8h, v1.16b, v3.16b ursra v4.8h, v4.8h, halide#8 ursra v0.8h, v0.8h, halide#8 urshr v1.8h, v4.8h, halide#8 urshr v0.8h, v0.8h, halide#8 add x17, x17, halide#16 // =16 stp q1, q0, [x18, #-16] add x18, x18, halide#32 // =32 b.ne .LBB0_10 So on X86 we skip a pointless and instruction, and on ARM we get a rounding add and shift right instead of a rounding narrowing add shift right followed by a widen. * Add test * Fix bug in test * Don't produce out-of-range lerp values
* add_requirement() maintenance This PR started out as a quick fix to add Python bindings for the `add_requirements` methods on Pipeline and Generator (which were missing), but expanded a bit to fix other issues as well: - The implementation of `Generator::add_requirement` was subtly wrong, in that it only worked if you called the method after everything else in your `generate()` method. Now we accumulate requirements and insert them at the end, so you can call the method anywhere. - We had C++ methods that took both an explicit `vector<Expr>` and also a variadic-template version, but the former required a mutable vector... and fixing this to not require that ended up creating ambiguity about which overloaded call to use. Added an ugly enable_if thing to resolve this. (Side note #1: overloading methods to have both templated and non-templated versions with the same name is probably something to avoid in the future.) (Side note #2: we should probably thing more carefully about using variadic templates in our public API in the future; we currently use it pretty heavily, but it tends to be messy and hard to reason about IMHO.) * tidy * remove underscores
No description provided.