-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Recover some lost EH performance on Unix amd64 #90033
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Few months ago, a change to enable AVX512 on Unix was merged in. That change has enlarged the CONTEXT structure significantly and it was found that it has caused 6..17% regressions in exception handling microbenchmarks. This change gets some of the lost performance back by adding custom copy constructor and assignment operator to the CONTEXT structure and preventing copying of the AVX / AVX512 stuff if the source context doesn't have it. I have observed 3..9% gain on my machine. Close dotnet#84308
| } | ||
| else | ||
| { | ||
| copySize = offsetof(CONTEXT, Ymm0H); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't ymm only be there if we have the XSTATE?
That is, by default the CONTEXT struct only tracks the general purpose registers + the XMM_SAVE_AREA. AVX (which includes the upper-bits for the YMM state), AVX512_KMASK, AVX512_ZMM_H, and AVX512_ZMM are all separate flags for XSTATE.
For the proper CONTEXT in Windows, we'd normally get the ContextSize once up front (from InitializeContext) and then be able to simply memcpy using the returned contextSize. We'd then locate the subset of these feature areas when required via the LocateXStateFeature API and determine which were available via GetXStateFeaturesMask.
I wonder if modeling it more this way to match would make it cheaper/simpler (or potentially not having the PAL in the way for this perf sensitive scenario at all, particularly since there are far more optimal ways to represent this info if we did that).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit confused - when XSTATE is not set, we copy everything upto to Ymm0H, so only the xmms are there.
Anyways, we should not copy the XStateFeaturesMask and XStateReserved0 in this case.
As for making the CONTEXT XState handling closer to how Windows does it (including the CopyContext / InitializeContext functions) might make sense, but I am not sure if it would really simplify anything.
In fact, we use the copying of the CONTEXT without using the real size at most places on Windows too (just assigning CONTEXT by value). I guess that those places just don't care about the xstate at all, so copying the base is sufficient. It just makes the target context incorrect (xstate indicating there is some xstate, but the copy not having any), so if some new code started to use such copies not having this on mind, subtle issues could occur.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit confused - when XSTATE is not set, we copy everything upto to Ymm0H, so only the xmms are there.
Anyways, we should not copy the XStateFeaturesMask and XStateReserved0 in this case
Ah right, I misread this as copying including Ymm0H and not copy up to Ymm0H, my mistake
In fact, we use the copying of the CONTEXT without using the real size at most places on Windows too (just assigning CONTEXT by value). I guess that those places just don't care about the xstate at all, so copying the base is sufficient. It just makes the target context incorrect (xstate indicating there is some xstate, but the copy not having any), so if some new code started to use such copies not having this on mind, subtle issues could occur.
Right. It'd be nice if we could be more explicit about these things to help avoid such subtle issues (not necessarily in this PR, just in general for the future).
tannergooding
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Left a comment about whether something was under the right condition or not and if we want to improve this more in the future.
| copySize = offsetof(CONTEXT, Ymm0H); | ||
| } | ||
|
|
||
| memcpy(this, &ctx, copySize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it have better performance characteristics to do memcpy with a constant in each of the arms above? (Alternatively, copy the first part unconditionally, then the additional parts only conditionally)? I have not looked at codegen, and maybe the C++ compilers already do this optimization, but memcpy with a constant length is usually handled much more efficiently by the compilers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had it that way originally (memcpy in each of the arms), but the compiler actually did compute the size first and then called memcpy once. I liked that, so I've transformed the source to the same way.
The zeroing of CONTEXT structure is used at too many places and it is valid, so rather than trying to modify all the places by using placement new, just disable the warning for gcc.
|
|
||
| typedef struct DECLSPEC_ALIGN(16) _CONTEXT { | ||
|
|
||
| _CONTEXT() = default; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change goes against the spirit of the Win32 PAL. The Win32 PAL is meant to mirror the Windows SDK definitions. The Windows SDK definitions of this structure is POD struct without any C++. This makes it C++ struct and behave differently from the Windows SDK definitions that can lead to all sort of surprises.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkotas the proper fix in the spirit of the Win32 PAL would be to implement CopyContext (and maybe InitializeContext too) and modify the whole codebase to use that instead of pure assignments of the CONTEXT structure. That sounded like too extensive change given the number of places we just assign CONTEXT structs by value in the runtime and the time I had left before the RC1 snapshot.
I'd be ok reverting this PR though and going for the cleaner solution if you prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are ok with this, I am ok with it too. I wanted to point this difference out since it is likely going to take somebody by surprise at some point.
If you would like to clean it up, it can wait for .NET 9.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, I'll leave this checked in and clean it up in 9 then.
Few months ago, a change to enable AVX512 on Unix was merged in. That change has enlarged the CONTEXT structure significantly and it was found that it has caused 6..17% regressions in exception handling microbenchmarks.
This change gets some of the lost performance back by adding custom copy constructor and assignment operator to the CONTEXT structure and preventing copying of the AVX / AVX512 stuff if the source context doesn't have it. I have observed 3..9% gain on my machine.
Close #84308