Adjusting `GetCurrentProcessorId` caching to different environments. #467

VSadov · 2019-12-03T01:27:28Z

GetProcessorNumber has a big range of possible implementations and performance characteristics. The slowest implementations can be 100x times slower than the fast ones, and we may not have seen all the range. Even on the same hardware performance may vary (i.e. x86 WOW process will have noticeably slower support than x64 on the same machine/OS).

At sub-context-switch times the result of GetProcessorNumber is fairly stable. Most OSs would try to keep the thread from “wandering” to other cores. As such it is possible to amortize the expenses of this API by thread-static caching. However the context switches do happen and the cached result has diminishing utility as the time passes.

The good news is that this API is fast on recent hardware/OS-es and is getting faster. It is not uncommon now to see systems where the call is fast enough that caching is not necessary or, in fact, may make it slower. There are systems where this API is actually faster than a standalone ThreadStatic access. (for example when RDPID instruction is available and OS uses it).

The goal of this change is to use as little caching as possible while not falling into perf cliffs on systems where API happens to be slow.

== In this change:

There are two on-demand checks that estimate the performance of GetProcessorNumber with a standalone ThreadStatic access used as a base reference. There is still a cache, but the caching strategy is adjusted accordingly to those checks.

On the first access to the GetCurrentProcessorId a quick and conservative check is performed to detect “definitely fast” systems.
On such systems no caching is required and no further checks are necessary. We get out of the way. Cache is bypassed.
Systems that do not pass the first check, will operate with default cache settings (refresh every 50 accesses).
At the time of cache refresh a calibration sample would be collected.
When enough samples are collected, a more precise refresh rate is computed. Typically it will be smaller than 50x. Sometimes it may be larger.
The total cost of calibration has upper bound and is not high. The impact is further mitigated by spreading out the measurements which makes calibration pay-for-play, takes it away from start-up and has dilutionary effect. As a result calibration is generally hardly noticeable in profiler.

VSadov · 2019-12-03T01:42:52Z

Thanks @tannergooding for validating this on Zen2 architecture (known to have RDPID support). #Closed

src/coreclr/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs

jkotas · 2019-12-03T02:26:42Z

src/coreclr/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs

+                double tlsMin = double.MaxValue;
+                for (int i = 0; i < CalibrationSamples; i++)
+                {
+                    idMin = Math.Min(idMin, calibrationState[i * 2]);       //ID


Do we really need to keep the whole array around to just compute the minimums at the end?

src/coreclr/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs

jkotas · 2019-12-03T18:50:25Z

cc @adamsitnik Any thoughts about impact of calibration algorithms like this on bench-marking?

adamsitnik · 2019-12-04T13:49:58Z

Any thoughts about impact of calibration algorithms like this on bench-marking?

BenchmarkDotNet does not use GetCurrentProcessorId at all. All the heuristics are based on execution time only.

/cc @AndreyAkinshin

src/coreclr/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs

VSadov · 2019-12-04T16:24:09Z

@adamsitnik - I think @jkotas is asking about impact on benchmarking from a general practice of using calibration like this.

If you notice calibration adds 5 msec total to the first 500 accesses (one sample taken per 50 accesses).
That is assuming we keep the latest calibration durations. We can dial that as a cost vs. variability tradeoff.

Further accesses will not have that cost but the results of calibration will have effect on the API by re-tuning the cache refresh from default to, hopefully, a better value, but there could be observable change and "better" is a bit of a fuzzy metric.
I.E. the average access may become slightly more expensive while precision of the API would be improved and have indirect effect of having fewer cache misses and sharing. Impact of that will depend on sensitivity of the use pattern.
Basically there is a slight performance "step" after 500 accesses when calibration is done.

Also, as long as we accept some degree of errors (as I said we can dial that vs. the cost of calibration), there could be some bimodality from run to run. Ex: in one run the cache refresh is set to 20 accesses in another 18.
This can be observed indirectly as a noise from run to run, even though the effects could be fairly indirect, unless errors are extreme.

Any thoughts whether this could be detrimental to benchmarking and whether benchmarking will be capable to catch any misbehavior. (i.e. long term variability turns out way higher than we expected or there are wild outliers)

src/coreclr/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs

fixed some comments

…Check to ProcessorIDCache.cs

src/coreclr/src/System.Private.CoreLib/src/System/Threading/ProcessorIdCache.cs

src/coreclr/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs

src/coreclr/src/System.Private.CoreLib/src/System/Threading/ProcessorIdCache.cs

adamsitnik · 2019-12-13T11:40:15Z

I think @jkotas is asking about impact on benchmarking from a general practice of using calibration like this.

I am sorry, I've misunderstood the question. Thanks for clarification @VSadov !

BenchmarkDotNet has non-trivial warmup. Let's consider following example run on my Ubuntu PC using latest .NET Core 5.0 SDK:

[Benchmark]
public int GetCurrentProcessorId() => Thread.GetCurrentProcessorId();

WorkloadJitting  1: 1 op, 337448.00 ns, 337.4480 us/op

WorkloadJitting  2: 16 op, 447148.00 ns, 27.9468 us/op

WorkloadPilot    1: 16 op, 931.00 ns, 58.1875 ns/op
WorkloadPilot    2: 4296464 op, 37519726.00 ns, 8.7327 ns/op
WorkloadPilot    3: 28628048 op, 216613893.00 ns, 7.5665 ns/op
WorkloadPilot    4: 33040416 op, 211301732.00 ns, 6.3953 ns/op
WorkloadPilot    5: 39091520 op, 250231990.00 ns, 6.4012 ns/op
WorkloadPilot    6: 39055280 op, 250907306.00 ns, 6.4244 ns/op
WorkloadPilot    7: 38914064 op, 248882611.00 ns, 6.3957 ns/op
WorkloadPilot    8: 39088784 op, 249162154.00 ns, 6.3743 ns/op
WorkloadPilot    9: 39220240 op, 251730406.00 ns, 6.4184 ns/op

WorkloadWarmup   1: 39220240 op, 248808770.00 ns, 6.3439 ns/op

First of all, we run the benchmark once to JIT the code. If this takes longer than IterationTime (500ms by default, 250ms for the performance repo) the warmup is over (to save time for long running benchmarks) and we run the benchmark once per iteration.

WorkloadJitting  1: 1 op, 337448.00 ns, 337.4480 us/op

If it takes less, we run the benchmark once with manual loop unrolling enabled. (to again JIT it)

WorkloadJitting  2: 16 op, 447148.00 ns, 27.9468 us/op

Then we start the Pilot phase which is supposed to find perfect invocation count (how many times to run the benchmark per single iteration):

WorkloadPilot    1: 16 op, 931.00 ns, 58.1875 ns/op
WorkloadPilot    2: 4296464 op, 37519726.00 ns, 8.7327 ns/op
WorkloadPilot    3: 28628048 op, 216613893.00 ns, 7.5665 ns/op
WorkloadPilot    4: 33040416 op, 211301732.00 ns, 6.3953 ns/op
WorkloadPilot    5: 39091520 op, 250231990.00 ns, 6.4012 ns/op
WorkloadPilot    6: 39055280 op, 250907306.00 ns, 6.4244 ns/op
WorkloadPilot    7: 38914064 op, 248882611.00 ns, 6.3957 ns/op
WorkloadPilot    8: 39088784 op, 249162154.00 ns, 6.3743 ns/op
WorkloadPilot    9: 39220240 op, 251730406.00 ns, 6.4184 ns/op

(The perfect invocation count above is 39220240)

After this BDN runs warmup phase which has a simple herustic that runs by default at least 6 iterations and stops when the results get stable. In performance repo we run only 1 warmup iteration.

WorkloadWarmup   1: 39220240 op, 248808770.00 ns, 6.3439 ns/op

So the answer to your question is that the code is going to be executed so many times before we start the actual benchmarking that the end result won't contain the "warmup" and "calibration" overhead.

However, if the calibration never stops, the benchmark results might be multimodal. BDN is going to print a warning and a nice histogram.

adamsitnik · 2019-12-13T11:41:46Z

BTW do we have any automated test for this API? Something that starts affinitized process and then validates the value returned by the API? (I am just curious, I know that writing it and keeping it stable for all OSes might be non-trivial)

VSadov · 2019-12-14T20:18:09Z

The API is used in other features so indirectly it is tested. The implementation is fairly simple though.

In reply to: 565411681 [](ancestors = 565411681)

couple fixes

src/coreclr/src/System.Private.CoreLib/src/System/Threading/ProcessorIdCache.cs

VSadov · 2019-12-15T00:51:09Z

OSX does not implement GetCurrentProcessorNumber ?

src/coreclr/src/System.Private.CoreLib/src/System/Threading/ProcessorIdCache.cs

src/coreclr/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs

VSadov · 2019-12-15T19:43:53Z

Thanks!!!

CoffeeFlux · 2020-01-13T21:38:07Z

src/coreclr/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs

-            if (currentProcessorId < 0) currentProcessorId = Environment.CurrentManagedThreadId;
-
-            // Add offset to make it clear that it is not guaranteed to be 0-based processor number
-            currentProcessorId += 100;


It looks like the offset was removed in this cache refactoring PR. Are we going to guarantee a 0-based processor number or not? Whichever we go with, the docs ought to be updated to reflect that and I'll happily PR the change. I think if we're not going to guarantee it, the offset is a good idea and should be reintroduced. @VSadov @jkotas

We do not guarantee 0-based processor number.

Whether or not to pay the extra cycle to add the offset is an interesting question. I do not have a strong opinion either way.

@CoffeeFlux - We cannot guarantee the [0..CpuCount) range. The underlying API may be nonfunctional and managed threadID used instead. On VMs the core ID could be outside of this range too.

Why do you think adding offset is a good idea though?.
It adds some cost to the API while anyone doing something like GetCurrentProcessorId & mask will be tempted to subtract the offset.

My fear is customers relying on the behavior anyway and then ending up with Mono bug reports because we can't guarantee it on mobile, but maybe we should just update the docs to be unambiguous about this. My opinion isn't that strong, I just want it to be clarified somewhere other than comments in the source code that we can't guarantee a range of [0..CpuCount).

After sleeping on it, I think I'm okay just leaving the value as-is and updating the docs. I'll PR the change later today.

1. The timer workaround is no longer needed since EventPipe file polling was removed in dotnet/coreclr#24225 2. dotnet/runtime#467 introduced a change that causes pmi non determinism. System.Threading.Thread.s_isProcessorNumberReallyFast can have different values on two invocations of the process on the same machine. (https://github.com/dotnet/runtime/blob/aedf8f52006619ef5d4eca65d79f42cc4b7bc402/src/coreclr/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs#L502) That causes non-determinism in generated code in methods' inlining System.Threading.Thread.GetCurrentProcessorId() (https://github.com/dotnet/runtime/blob/aedf8f52006619ef5d4eca65d79f42cc4b7bc402/src/coreclr/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs#L492-L498) The workaround is to set the value of System.Threading.Thread.s_isProcessorNumberReallyFast to true via reflection.

VSadov requested review from jkotas and stephentoub December 3, 2019 01:35

tannergooding reviewed Dec 3, 2019

View reviewed changes

src/coreclr/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs Outdated Show resolved Hide resolved

jkotas reviewed Dec 3, 2019

View reviewed changes

src/coreclr/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs Outdated Show resolved Hide resolved

jkotas reviewed Dec 3, 2019

View reviewed changes

src/coreclr/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs Outdated Show resolved Hide resolved

jkotas reviewed Dec 3, 2019

View reviewed changes

src/coreclr/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs Outdated Show resolved Hide resolved

jkotas reviewed Dec 3, 2019

View reviewed changes

src/coreclr/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs Outdated Show resolved Hide resolved

jkotas reviewed Dec 3, 2019

View reviewed changes

src/coreclr/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs Outdated Show resolved Hide resolved

jkotas reviewed Dec 3, 2019

View reviewed changes

src/coreclr/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs Outdated Show resolved Hide resolved

jkotas reviewed Dec 3, 2019

View reviewed changes

src/coreclr/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs Outdated Show resolved Hide resolved

jkotas reviewed Dec 3, 2019

View reviewed changes

src/coreclr/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs Outdated Show resolved Hide resolved

jkotas reviewed Dec 3, 2019

View reviewed changes

src/coreclr/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs Outdated Show resolved Hide resolved

jkotas added the area-System.Threading label Dec 3, 2019

VSadov force-pushed the CoreId branch from 7e39598 to c646a6f Compare December 4, 2019 06:38

jkotas reviewed Dec 4, 2019

View reviewed changes

src/coreclr/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs Outdated Show resolved Hide resolved

jkotas reviewed Dec 4, 2019

View reviewed changes

src/coreclr/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs Outdated Show resolved Hide resolved

jkotas reviewed Dec 4, 2019

View reviewed changes

src/coreclr/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs Outdated Show resolved Hide resolved

jkotas reviewed Dec 4, 2019

View reviewed changes

src/coreclr/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs Outdated Show resolved Hide resolved

VSadov mentioned this pull request Dec 5, 2019

Fix Thread.GetCurrentProcessorId for > 64 CPUs on Windows. #581

Merged

VSadov added 5 commits December 6, 2019 16:03

Adjusting GetCurrentProcessorId caching to different environments.

d7ee11f

Addressed some PR comments (style, code structure).

9ffc335

fitting calibration under 5 msec total

df36e9d

fixed some comments

moved ProcessorIdCache to a separate file.

f911773

increased "fast" threshold to 3x and moved SimpleProcessorNumberSpeed…

bb4847f

…Check to ProcessorIDCache.cs

VSadov force-pushed the CoreId branch from 93532b5 to bb4847f Compare December 7, 2019 00:56

jkotas reviewed Dec 12, 2019

View reviewed changes

src/coreclr/src/System.Private.CoreLib/src/System/Threading/ProcessorIdCache.cs Outdated Show resolved Hide resolved

jkotas reviewed Dec 12, 2019

View reviewed changes

src/coreclr/src/System.Private.CoreLib/src/System/Threading/ProcessorIdCache.cs Outdated Show resolved Hide resolved

jkotas reviewed Dec 12, 2019

View reviewed changes

src/coreclr/src/System.Private.CoreLib/src/System/Threading/ProcessorIdCache.cs Outdated Show resolved Hide resolved

jkotas reviewed Dec 12, 2019

View reviewed changes

src/coreclr/src/System.Private.CoreLib/src/System/Threading/ProcessorIdCache.cs Outdated Show resolved Hide resolved

jkotas reviewed Dec 13, 2019

View reviewed changes

src/coreclr/src/System.Private.CoreLib/src/System/Threading/ProcessorIdCache.cs Outdated Show resolved Hide resolved

jkotas reviewed Dec 13, 2019

View reviewed changes

src/coreclr/src/System.Private.CoreLib/src/System/Threading/ProcessorIdCache.cs Outdated Show resolved Hide resolved

adamsitnik reviewed Dec 13, 2019

View reviewed changes

src/coreclr/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs Show resolved Hide resolved

adamsitnik reviewed Dec 13, 2019

View reviewed changes

src/coreclr/src/System.Private.CoreLib/src/System/Threading/ProcessorIdCache.cs Show resolved Hide resolved

adamsitnik reviewed Dec 13, 2019

View reviewed changes

src/coreclr/src/System.Private.CoreLib/src/System/Threading/ProcessorIdCache.cs Show resolved Hide resolved

setting amortization ratio to 5

3154533

couple fixes

jkotas reviewed Dec 14, 2019

View reviewed changes

src/coreclr/src/System.Private.CoreLib/src/System/Threading/ProcessorIdCache.cs Show resolved Hide resolved

fix unimplemented GetCurrentProcessorNumber case

f483aa9

jkotas reviewed Dec 15, 2019

View reviewed changes

src/coreclr/src/System.Private.CoreLib/src/System/Threading/ProcessorIdCache.cs Outdated Show resolved Hide resolved

jkotas reviewed Dec 15, 2019

View reviewed changes

src/coreclr/src/System.Private.CoreLib/src/System/Threading/ProcessorIdCache.cs Outdated Show resolved Hide resolved

jkotas reviewed Dec 15, 2019

View reviewed changes

src/coreclr/src/System.Private.CoreLib/src/System/Threading/ProcessorIdCache.cs Show resolved Hide resolved

jkotas reviewed Dec 15, 2019

View reviewed changes

src/coreclr/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs Outdated Show resolved Hide resolved

PR feedback

1618787

jkotas approved these changes Dec 15, 2019

View reviewed changes

VSadov merged commit aedf8f5 into dotnet:master Dec 15, 2019

VSadov deleted the CoreId branch December 15, 2019 19:43

CoffeeFlux reviewed Jan 13, 2020

View reviewed changes

erozenfeld mentioned this pull request Mar 20, 2020

Determinism changes. dotnet/jitutils#255

Merged

ghost locked as resolved and limited conversation to collaborators Dec 11, 2020

Adjusting GetCurrentProcessorId caching to different environments. #467

Adjusting GetCurrentProcessorId caching to different environments. #467

Uh oh!

Conversation

VSadov commented Dec 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VSadov commented Dec 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jkotas Dec 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jkotas commented Dec 3, 2019

Uh oh!

adamsitnik commented Dec 4, 2019

Uh oh!

Uh oh!

Uh oh!

VSadov commented Dec 4, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adamsitnik commented Dec 13, 2019

Uh oh!

adamsitnik commented Dec 13, 2019

Uh oh!

VSadov commented Dec 14, 2019

Uh oh!

Uh oh!

VSadov commented Dec 15, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

VSadov commented Dec 15, 2019

Uh oh!

CoffeeFlux Jan 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jkotas Jan 13, 2020

Choose a reason for hiding this comment

Uh oh!

VSadov Jan 13, 2020

Choose a reason for hiding this comment

Uh oh!

CoffeeFlux Jan 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CoffeeFlux Jan 14, 2020

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Adjusting `GetCurrentProcessorId` caching to different environments. #467

Adjusting `GetCurrentProcessorId` caching to different environments. #467

VSadov commented Dec 3, 2019 •

edited

Loading

VSadov commented Dec 3, 2019 •

edited

Loading

jkotas Dec 3, 2019 •

edited

Loading

CoffeeFlux Jan 13, 2020 •

edited

Loading

CoffeeFlux Jan 13, 2020 •

edited

Loading