-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Adjusting GetCurrentProcessorId caching to different environments.
#467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks @tannergooding for validating this on |
src/coreclr/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs
Outdated
Show resolved
Hide resolved
src/coreclr/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs
Outdated
Show resolved
Hide resolved
src/coreclr/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs
Outdated
Show resolved
Hide resolved
src/coreclr/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs
Outdated
Show resolved
Hide resolved
src/coreclr/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs
Outdated
Show resolved
Hide resolved
src/coreclr/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs
Outdated
Show resolved
Hide resolved
src/coreclr/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs
Outdated
Show resolved
Hide resolved
src/coreclr/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs
Outdated
Show resolved
Hide resolved
| double tlsMin = double.MaxValue; | ||
| for (int i = 0; i < CalibrationSamples; i++) | ||
| { | ||
| idMin = Math.Min(idMin, calibrationState[i * 2]); //ID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need to keep the whole array around to just compute the minimums at the end?
src/coreclr/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs
Outdated
Show resolved
Hide resolved
src/coreclr/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs
Outdated
Show resolved
Hide resolved
|
cc @adamsitnik Any thoughts about impact of calibration algorithms like this on bench-marking? |
BenchmarkDotNet does not use /cc @AndreyAkinshin |
src/coreclr/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs
Outdated
Show resolved
Hide resolved
src/coreclr/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs
Outdated
Show resolved
Hide resolved
|
@adamsitnik - I think @jkotas is asking about impact on benchmarking from a general practice of using calibration like this. If you notice calibration adds 5 msec total to the first 500 accesses (one sample taken per 50 accesses). Further accesses will not have that cost but the results of calibration will have effect on the API by re-tuning the cache refresh from default to, hopefully, a better value, but there could be observable change and "better" is a bit of a fuzzy metric. Also, as long as we accept some degree of errors (as I said we can dial that vs. the cost of calibration), there could be some bimodality from run to run. Ex: in one run the cache refresh is set to 20 accesses in another 18. Any thoughts whether this could be detrimental to benchmarking and whether benchmarking will be capable to catch any misbehavior. (i.e. long term variability turns out way higher than we expected or there are wild outliers) |
src/coreclr/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs
Outdated
Show resolved
Hide resolved
src/coreclr/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs
Outdated
Show resolved
Hide resolved
fixed some comments
…Check to ProcessorIDCache.cs
src/coreclr/src/System.Private.CoreLib/src/System/Threading/ProcessorIdCache.cs
Outdated
Show resolved
Hide resolved
src/coreclr/src/System.Private.CoreLib/src/System/Threading/ProcessorIdCache.cs
Outdated
Show resolved
Hide resolved
src/coreclr/src/System.Private.CoreLib/src/System/Threading/ProcessorIdCache.cs
Outdated
Show resolved
Hide resolved
src/coreclr/src/System.Private.CoreLib/src/System/Threading/ProcessorIdCache.cs
Outdated
Show resolved
Hide resolved
src/coreclr/src/System.Private.CoreLib/src/System/Threading/ProcessorIdCache.cs
Outdated
Show resolved
Hide resolved
src/coreclr/src/System.Private.CoreLib/src/System/Threading/ProcessorIdCache.cs
Outdated
Show resolved
Hide resolved
src/coreclr/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs
Show resolved
Hide resolved
src/coreclr/src/System.Private.CoreLib/src/System/Threading/ProcessorIdCache.cs
Show resolved
Hide resolved
src/coreclr/src/System.Private.CoreLib/src/System/Threading/ProcessorIdCache.cs
Show resolved
Hide resolved
I am sorry, I've misunderstood the question. Thanks for clarification @VSadov ! BenchmarkDotNet has non-trivial warmup. Let's consider following example run on my Ubuntu PC using latest .NET Core 5.0 SDK: [Benchmark]
public int GetCurrentProcessorId() => Thread.GetCurrentProcessorId();First of all, we run the benchmark once to JIT the code. If this takes longer than If it takes less, we run the benchmark once with manual loop unrolling enabled. (to again JIT it) Then we start the Pilot phase which is supposed to find perfect invocation count (how many times to run the benchmark per single iteration): (The perfect invocation count above is After this BDN runs warmup phase which has a simple herustic that runs by default at least 6 iterations and stops when the results get stable. In performance repo we run only 1 warmup iteration. So the answer to your question is that the code is going to be executed so many times before we start the actual benchmarking that the end result won't contain the "warmup" and "calibration" overhead. However, if the calibration never stops, the benchmark results might be multimodal. BDN is going to print a warning and a nice histogram. |
|
BTW do we have any automated test for this API? Something that starts affinitized process and then validates the value returned by the API? (I am just curious, I know that writing it and keeping it stable for all OSes might be non-trivial) |
|
The API is used in other features so indirectly it is tested. The implementation is fairly simple though. In reply to: 565411681 [](ancestors = 565411681) |
couple fixes
src/coreclr/src/System.Private.CoreLib/src/System/Threading/ProcessorIdCache.cs
Show resolved
Hide resolved
|
OSX does not implement |
src/coreclr/src/System.Private.CoreLib/src/System/Threading/ProcessorIdCache.cs
Outdated
Show resolved
Hide resolved
src/coreclr/src/System.Private.CoreLib/src/System/Threading/ProcessorIdCache.cs
Outdated
Show resolved
Hide resolved
src/coreclr/src/System.Private.CoreLib/src/System/Threading/ProcessorIdCache.cs
Show resolved
Hide resolved
src/coreclr/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs
Outdated
Show resolved
Hide resolved
|
Thanks!!! |
| if (currentProcessorId < 0) currentProcessorId = Environment.CurrentManagedThreadId; | ||
|
|
||
| // Add offset to make it clear that it is not guaranteed to be 0-based processor number | ||
| currentProcessorId += 100; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the offset was removed in this cache refactoring PR. Are we going to guarantee a 0-based processor number or not? Whichever we go with, the docs ought to be updated to reflect that and I'll happily PR the change. I think if we're not going to guarantee it, the offset is a good idea and should be reintroduced. @VSadov @jkotas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do not guarantee 0-based processor number.
Whether or not to pay the extra cycle to add the offset is an interesting question. I do not have a strong opinion either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@CoffeeFlux - We cannot guarantee the [0..CpuCount) range. The underlying API may be nonfunctional and managed threadID used instead. On VMs the core ID could be outside of this range too.
Why do you think adding offset is a good idea though?.
It adds some cost to the API while anyone doing something like GetCurrentProcessorId & mask will be tempted to subtract the offset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My fear is customers relying on the behavior anyway and then ending up with Mono bug reports because we can't guarantee it on mobile, but maybe we should just update the docs to be unambiguous about this. My opinion isn't that strong, I just want it to be clarified somewhere other than comments in the source code that we can't guarantee a range of [0..CpuCount).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After sleeping on it, I think I'm okay just leaving the value as-is and updating the docs. I'll PR the change later today.
1. The timer workaround is no longer needed since EventPipe file polling was removed in dotnet/coreclr#24225 2. dotnet/runtime#467 introduced a change that causes pmi non determinism. System.Threading.Thread.s_isProcessorNumberReallyFast can have different values on two invocations of the process on the same machine. (https://github.com/dotnet/runtime/blob/aedf8f52006619ef5d4eca65d79f42cc4b7bc402/src/coreclr/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs#L502) That causes non-determinism in generated code in methods' inlining System.Threading.Thread.GetCurrentProcessorId() (https://github.com/dotnet/runtime/blob/aedf8f52006619ef5d4eca65d79f42cc4b7bc402/src/coreclr/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs#L492-L498) The workaround is to set the value of System.Threading.Thread.s_isProcessorNumberReallyFast to true via reflection.
1. The timer workaround is no longer needed since EventPipe file polling was removed in dotnet/coreclr#24225 2. dotnet/runtime#467 introduced a change that causes pmi non determinism. System.Threading.Thread.s_isProcessorNumberReallyFast can have different values on two invocations of the process on the same machine. (https://github.com/dotnet/runtime/blob/aedf8f52006619ef5d4eca65d79f42cc4b7bc402/src/coreclr/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs#L502) That causes non-determinism in generated code in methods' inlining System.Threading.Thread.GetCurrentProcessorId() (https://github.com/dotnet/runtime/blob/aedf8f52006619ef5d4eca65d79f42cc4b7bc402/src/coreclr/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs#L492-L498) The workaround is to set the value of System.Threading.Thread.s_isProcessorNumberReallyFast to true via reflection.
GetProcessorNumberhas a big range of possible implementations and performance characteristics. The slowest implementations can be 100x times slower than the fast ones, and we may not have seen all the range. Even on the same hardware performance may vary (i.e. x86 WOW process will have noticeably slower support than x64 on the same machine/OS).At sub-context-switch times the result of
GetProcessorNumberis fairly stable. Most OSs would try to keep the thread from “wandering” to other cores. As such it is possible to amortize the expenses of this API by thread-static caching. However the context switches do happen and the cached result has diminishing utility as the time passes.The good news is that this API is fast on recent hardware/OS-es and is getting faster. It is not uncommon now to see systems where the call is fast enough that caching is not necessary or, in fact, may make it slower. There are systems where this API is actually faster than a standalone
ThreadStaticaccess. (for example when RDPID instruction is available and OS uses it).The goal of this change is to use as little caching as possible while not falling into perf cliffs on systems where API happens to be slow.
== In this change:
There are two on-demand checks that estimate the performance of
GetProcessorNumberwith a standaloneThreadStaticaccess used as a base reference. There is still a cache, but the caching strategy is adjusted accordingly to those checks.GetCurrentProcessorIda quick and conservative check is performed to detect “definitely fast” systems.On such systems no caching is required and no further checks are necessary. We get out of the way. Cache is bypassed.
At the time of cache refresh a calibration sample would be collected.
When enough samples are collected, a more precise refresh rate is computed. Typically it will be smaller than 50x. Sometimes it may be larger.
The total cost of calibration has upper bound and is not high. The impact is further mitigated by spreading out the measurements which makes calibration pay-for-play, takes it away from start-up and has dilutionary effect. As a result calibration is generally hardly noticeable in profiler.