[PERF] Turning on parallel init for AMDGPUs and enabling AMDGPU tests#3
[PERF] Turning on parallel init for AMDGPUs and enabling AMDGPU tests#3jamesETsmith merged 13 commits intoamd-integrationfrom
Conversation
This change also adds amdgpu to the list of backends supported by the test suite so it can be autodetected.
|
I'm working on one more tweak to the test script, I'll push that in the next hour |
These attributes were only set on AMDGPU_KERNEL functions, creating an attribute mismatch with internal runtime functions (like gpu_parallel_range_for). LLVM's inliner refuses to inline functions with incompatible target-specific attributes, which prevented the runtime functions from being inlined into kernels. Without inlining, InferAddressSpaces can't see the full pointer chain from kernel params to field data, so it can't promote flat pointers to global. This resulted in flat_* instructions everywhere instead of global_*, causing a ~4% throughput regression (301k vs 314k). Moving amdgpu-ieee and amdgpu-dx10-clamp to the all-functions loop makes the attributes compatible, enabling inlining and allowing InferAddressSpaces to promote to global_load/global_store/global_atomic. Made-with: Cursor
|
Do not merge this PR until #7 is merged |
|
Need to run pre-submit again |
|
/run-ci |
|
Please do not merge this PR @yaoliu13, more changes are incoming. |
…/jets/parallel_init
- MatrixPtrStmt byte-offset path: replace PtrToInt/IntToPtr with i8 GEP to preserve pointer provenance and address space for InferAddressSpaces. Fixes cross-scope matrix operation hangs. - optimized_reduction: force-inline runtime reduce_* callees so InferAddressSpaces can promote flat_atomic_cmpswap back to global_atomic_cmpswap after inlining. Fixes reduction test crashes caused by L1 cache coherency issues with flat atomics on MI300X. Made-with: Cursor
- runtime.cpp: Replace S_ENDPGM with __builtin_trap() in the AMDGPU assert handler. S_ENDPGM only kills the current wavefront, leaving other wavefronts spinning forever. __builtin_trap() emits s_trap 2 which halts the entire dispatch and returns hipErrorLaunchFailure to the host, preventing hangs in tests like test_ipow_negative_exp_i32. - test_element_wise.py: Loosen pow() tolerance to rel=1e-3 for the test_binary_f assertion. AMDGPU's __ocml_pow_f32 uses log2->mul->exp2 which gives ~0.06% relative error vs NumPy's x86 pow. Made-with: Cursor
|
Since these changes turned on a bunch of new tests, we needed new fixes. @deepsek did the first round of fixes in #7 and these fix some more. I'll check on the tests tomorrow, but my guess is a few more will need fixing AMDGPU Test Fixes Summary1. MatrixPtrStmt byte-offset GEPCross-scope matrix tests hang indefinitely (5 tests). The byte-offset path in Fix: Replace with File: 2. Direct LLVM atomics for reductionsAll reduction tests crash with SIGABRT or hang during compilation. The runtime Fix: Bypass the runtime helpers and emit direct LLVM atomics:
File: 3. Assert handler
|
…w tolerance - optimized_reduction: Replace runtime reduce_* helper calls with direct LLVM AtomicRMW / atomic_op_using_cas. The runtime helpers expect addrspace(0) pointers, requiring an addrspace cast + AlwaysInline for correctness, which caused 13+ minute compilation blowup. Direct atomics preserve the dest address space natively and compile fast. i32: AtomicRMW for add/min/max/and/or/xor f32: AtomicRMW FAdd for add, CAS loop for min/max - runtime.cpp: Replace S_ENDPGM with __builtin_trap() in the AMDGPU assert handler to halt the entire dispatch instead of just one wavefront. - test_element_wise.py: Loosen pow() tolerance to rel=1e-3 for AMDGPU __ocml_pow_f32 precision difference. Made-with: Cursor
|
I provided numbers in my previous comment |
Yea but those were before my changes which could have potentially affected the performance |
|
Several tests are still flaky. Since this PR is blocking other quadrants work like #9, I'm going to skip those tests (since the genesis tests still pass) and I'll revisit them later. |
Summary27 tests are excluded from AMDGPU runs via Root causes
Skipped testsAssert handler (16 tests)
|
|
/run-ci |
1 similar comment
|
/run-ci |
|
/run-ci |
|
Pre-submit throughput looks good |
Summary
Previously, only CUDA architectures would use the parallel init function for quadrants so AMDGPUs would default to the serial version which is orders of magnitude slower. It's a one time penalty, so it doesn't have a huge practical impact, but we shouldn't be handicapping ourselves even with the init functions like this.
After the initial changes in this PR, I also wanted to add some quality of life changes to the testing scripts and python test suite (allowing the test suite to detect and use amdgpu backend automatically).
Validation
Performance
As noted above, this changes doesn't affect the performance numbers on our benchmarks. So the throughput we're seeing stays the same