Skip to content

Fix out of bounds reads in strided ARM loads#5784

Merged
dsharletg merged 14 commits intomasterfrom
dsharletg/fix-arm-seg2
Mar 9, 2021
Merged

Fix out of bounds reads in strided ARM loads#5784
dsharletg merged 14 commits intomasterfrom
dsharletg/fix-arm-seg2

Conversation

@dsharletg
Copy link
Contributor

@dsharletg dsharletg commented Mar 3, 2021

This PR fixes long-standing out of bounds reads for strided loads. This combines some of the logic in CodeGen_LLVM (for stride 2 loads) with the logic in CodeGen_ARM (for stride up to 4), and implements it in CodeGen_LLVM.

This will be a performance regression for code that uses strided loads from external buffers without sufficient alignment information to determine the loads are safe. However, measurement on a variety of code shows the performance impact is mostly small or even improvements in some cases.

@steven-johnson
Copy link
Contributor

Looks clean. Ready to land?

@dsharletg
Copy link
Contributor Author

We discussed the possibility of emitting a warning when we do not generate vldN when we would have before this change. However, I now think we should not do this, because I found many cases where this warning would be emitted, yet performance is better or unchanged. Overall, the performance impact of this is actually a lot more mixed than I expected. There are a few extreme cases where performance is up to 2x worse, but most of the time, the impact is negligible and often an improvement.

@abadams
Copy link
Member

abadams commented Mar 8, 2021

It's surprising that there are cases where performance is better. Have you looked at the asm to see why?

@dsharletg
Copy link
Contributor Author

I think those cases are mostly vld2, and the base class generates reasonable code for those (2 loads + shuffle).

@dsharletg
Copy link
Contributor Author

I modified this to implement up to stride 4 in CodeGen_LLVM, using a combination of the old CodeGen_LLVM logic, and the new CodeGen_ARM logic.

@abadams
Copy link
Member

abadams commented Mar 9, 2021

I'm seeing some good speed-ups on the packed cases in the resize app on x86 due to using dense loads and shuffles instead of gathers.

@steven-johnson
Copy link
Contributor

correctness_memoize is failing on arm32.

@dsharletg
Copy link
Contributor Author

I can't reproduce that and correctness_memoize looks completely unaffected by this change.

@dsharletg
Copy link
Contributor Author

I restarted that build let's just see what happens.

@dsharletg
Copy link
Contributor Author

Oh, the same failure occurred on another unrelated PR: https://buildbot.halide-lang.org/master/#/builders/32/builds/16, as well as quite a few other issues.

I think that is surely a flake.

@steven-johnson
Copy link
Contributor

Oh, the same failure occurred on another unrelated PR: https://buildbot.halide-lang.org/master/#/builders/32/builds/16, as well as quite a few other issues.

I think that is surely a flake.

I wonder if #5780 could be causing the memoize issue?

@dsharletg dsharletg merged commit e75d9fb into master Mar 9, 2021
@dsharletg dsharletg deleted the dsharletg/fix-arm-seg2 branch March 9, 2021 22:16
dsharletg added a commit that referenced this pull request Mar 10, 2021
dsharletg added a commit that referenced this pull request Mar 10, 2021
dsharletg added a commit that referenced this pull request Mar 10, 2021
@alexreinking alexreinking added this to the v12.0.0 milestone May 19, 2021
@abadams
Copy link
Member

abadams commented Jun 22, 2022

Dang, this was actually a major regression for strided loads from input buffers on x86. Now it's a vector gather instead of a load and shuffle. Not sure how we missed it.

@abadams
Copy link
Member

abadams commented Jun 22, 2022

Basic 2x downsampling pipelines are a mess now.

@abadams
Copy link
Member

abadams commented Jun 22, 2022

Plus as far as I can tell, this loads from before the start of external buffers due to the offset being applied unconditionally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants