Fix out of bounds reads in strided ARM loads#5784
Conversation
|
Looks clean. Ready to land? |
|
We discussed the possibility of emitting a warning when we do not generate vldN when we would have before this change. However, I now think we should not do this, because I found many cases where this warning would be emitted, yet performance is better or unchanged. Overall, the performance impact of this is actually a lot more mixed than I expected. There are a few extreme cases where performance is up to 2x worse, but most of the time, the impact is negligible and often an improvement. |
|
It's surprising that there are cases where performance is better. Have you looked at the asm to see why? |
|
I think those cases are mostly vld2, and the base class generates reasonable code for those (2 loads + shuffle). |
|
I modified this to implement up to stride 4 in CodeGen_LLVM, using a combination of the old CodeGen_LLVM logic, and the new CodeGen_ARM logic. |
|
I'm seeing some good speed-ups on the packed cases in the resize app on x86 due to using dense loads and shuffles instead of gathers. |
|
|
|
I can't reproduce that and correctness_memoize looks completely unaffected by this change. |
|
I restarted that build let's just see what happens. |
|
Oh, the same failure occurred on another unrelated PR: https://buildbot.halide-lang.org/master/#/builders/32/builds/16, as well as quite a few other issues. I think that is surely a flake. |
I wonder if #5780 could be causing the memoize issue? |
|
Dang, this was actually a major regression for strided loads from input buffers on x86. Now it's a vector gather instead of a load and shuffle. Not sure how we missed it. |
|
Basic 2x downsampling pipelines are a mess now. |
|
Plus as far as I can tell, this loads from before the start of external buffers due to the offset being applied unconditionally. |
This PR fixes long-standing out of bounds reads for strided loads. This combines some of the logic in CodeGen_LLVM (for stride 2 loads) with the logic in CodeGen_ARM (for stride up to 4), and implements it in CodeGen_LLVM.
This will be a performance regression for code that uses strided loads from external buffers without sufficient alignment information to determine the loads are safe. However, measurement on a variety of code shows the performance impact is mostly small or even improvements in some cases.