Skip to content

Conversation

@VSadov
Copy link
Member

@VSadov VSadov commented Dec 3, 2019

GetProcessorNumber has a big range of possible implementations and performance characteristics. The slowest implementations can be 100x times slower than the fast ones, and we may not have seen all the range. Even on the same hardware performance may vary (i.e. x86 WOW process will have noticeably slower support than x64 on the same machine/OS).

At sub-context-switch times the result of GetProcessorNumber is fairly stable. Most OSs would try to keep the thread from “wandering” to other cores. As such it is possible to amortize the expenses of this API by thread-static caching. However the context switches do happen and the cached result has diminishing utility as the time passes.

The good news is that this API is fast on recent hardware/OS-es and is getting faster. It is not uncommon now to see systems where the call is fast enough that caching is not necessary or, in fact, may make it slower. There are systems where this API is actually faster than a standalone ThreadStatic access. (for example when RDPID instruction is available and OS uses it).

The goal of this change is to use as little caching as possible while not falling into perf cliffs on systems where API happens to be slow.

== In this change:

There are two on-demand checks that estimate the performance of GetProcessorNumber with a standalone ThreadStatic access used as a base reference. There is still a cache, but the caching strategy is adjusted accordingly to those checks.

  1. On the first access to the GetCurrentProcessorId a quick and conservative check is performed to detect “definitely fast” systems.
    On such systems no caching is required and no further checks are necessary. We get out of the way. Cache is bypassed.
  2. Systems that do not pass the first check, will operate with default cache settings (refresh every 50 accesses).
    At the time of cache refresh a calibration sample would be collected.
    When enough samples are collected, a more precise refresh rate is computed. Typically it will be smaller than 50x. Sometimes it may be larger.
    The total cost of calibration has upper bound and is not high. The impact is further mitigated by spreading out the measurements which makes calibration pay-for-play, takes it away from start-up and has dilutionary effect. As a result calibration is generally hardly noticeable in profiler.

@VSadov VSadov requested review from jkotas and stephentoub December 3, 2019 01:35
@VSadov
Copy link
Member Author

VSadov commented Dec 3, 2019

Thanks @tannergooding for validating this on Zen2 architecture (known to have RDPID support). #Closed

double tlsMin = double.MaxValue;
for (int i = 0; i < CalibrationSamples; i++)
{
idMin = Math.Min(idMin, calibrationState[i * 2]); //ID
Copy link
Member

@jkotas jkotas Dec 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to keep the whole array around to just compute the minimums at the end?

@jkotas
Copy link
Member

jkotas commented Dec 3, 2019

cc @adamsitnik Any thoughts about impact of calibration algorithms like this on bench-marking?

@adamsitnik
Copy link
Member

Any thoughts about impact of calibration algorithms like this on bench-marking?

BenchmarkDotNet does not use GetCurrentProcessorId at all. All the heuristics are based on execution time only.

/cc @AndreyAkinshin

@VSadov
Copy link
Member Author

VSadov commented Dec 4, 2019

@adamsitnik - I think @jkotas is asking about impact on benchmarking from a general practice of using calibration like this.

If you notice calibration adds 5 msec total to the first 500 accesses (one sample taken per 50 accesses).
That is assuming we keep the latest calibration durations. We can dial that as a cost vs. variability tradeoff.

Further accesses will not have that cost but the results of calibration will have effect on the API by re-tuning the cache refresh from default to, hopefully, a better value, but there could be observable change and "better" is a bit of a fuzzy metric.
I.E. the average access may become slightly more expensive while precision of the API would be improved and have indirect effect of having fewer cache misses and sharing. Impact of that will depend on sensitivity of the use pattern.
Basically there is a slight performance "step" after 500 accesses when calibration is done.

Also, as long as we accept some degree of errors (as I said we can dial that vs. the cost of calibration), there could be some bimodality from run to run. Ex: in one run the cache refresh is set to 20 accesses in another 18.
This can be observed indirectly as a noise from run to run, even though the effects could be fairly indirect, unless errors are extreme.

Any thoughts whether this could be detrimental to benchmarking and whether benchmarking will be capable to catch any misbehavior. (i.e. long term variability turns out way higher than we expected or there are wild outliers)

@adamsitnik
Copy link
Member

I think @jkotas is asking about impact on benchmarking from a general practice of using calibration like this.

I am sorry, I've misunderstood the question. Thanks for clarification @VSadov !

BenchmarkDotNet has non-trivial warmup. Let's consider following example run on my Ubuntu PC using latest .NET Core 5.0 SDK:

[Benchmark]
public int GetCurrentProcessorId() => Thread.GetCurrentProcessorId();
WorkloadJitting  1: 1 op, 337448.00 ns, 337.4480 us/op

WorkloadJitting  2: 16 op, 447148.00 ns, 27.9468 us/op

WorkloadPilot    1: 16 op, 931.00 ns, 58.1875 ns/op
WorkloadPilot    2: 4296464 op, 37519726.00 ns, 8.7327 ns/op
WorkloadPilot    3: 28628048 op, 216613893.00 ns, 7.5665 ns/op
WorkloadPilot    4: 33040416 op, 211301732.00 ns, 6.3953 ns/op
WorkloadPilot    5: 39091520 op, 250231990.00 ns, 6.4012 ns/op
WorkloadPilot    6: 39055280 op, 250907306.00 ns, 6.4244 ns/op
WorkloadPilot    7: 38914064 op, 248882611.00 ns, 6.3957 ns/op
WorkloadPilot    8: 39088784 op, 249162154.00 ns, 6.3743 ns/op
WorkloadPilot    9: 39220240 op, 251730406.00 ns, 6.4184 ns/op

WorkloadWarmup   1: 39220240 op, 248808770.00 ns, 6.3439 ns/op

First of all, we run the benchmark once to JIT the code. If this takes longer than IterationTime (500ms by default, 250ms for the performance repo) the warmup is over (to save time for long running benchmarks) and we run the benchmark once per iteration.

WorkloadJitting  1: 1 op, 337448.00 ns, 337.4480 us/op

If it takes less, we run the benchmark once with manual loop unrolling enabled. (to again JIT it)

WorkloadJitting  2: 16 op, 447148.00 ns, 27.9468 us/op

Then we start the Pilot phase which is supposed to find perfect invocation count (how many times to run the benchmark per single iteration):

WorkloadPilot    1: 16 op, 931.00 ns, 58.1875 ns/op
WorkloadPilot    2: 4296464 op, 37519726.00 ns, 8.7327 ns/op
WorkloadPilot    3: 28628048 op, 216613893.00 ns, 7.5665 ns/op
WorkloadPilot    4: 33040416 op, 211301732.00 ns, 6.3953 ns/op
WorkloadPilot    5: 39091520 op, 250231990.00 ns, 6.4012 ns/op
WorkloadPilot    6: 39055280 op, 250907306.00 ns, 6.4244 ns/op
WorkloadPilot    7: 38914064 op, 248882611.00 ns, 6.3957 ns/op
WorkloadPilot    8: 39088784 op, 249162154.00 ns, 6.3743 ns/op
WorkloadPilot    9: 39220240 op, 251730406.00 ns, 6.4184 ns/op

(The perfect invocation count above is 39220240)

After this BDN runs warmup phase which has a simple herustic that runs by default at least 6 iterations and stops when the results get stable. In performance repo we run only 1 warmup iteration.

WorkloadWarmup   1: 39220240 op, 248808770.00 ns, 6.3439 ns/op

So the answer to your question is that the code is going to be executed so many times before we start the actual benchmarking that the end result won't contain the "warmup" and "calibration" overhead.

However, if the calibration never stops, the benchmark results might be multimodal. BDN is going to print a warning and a nice histogram.

@adamsitnik
Copy link
Member

BTW do we have any automated test for this API? Something that starts affinitized process and then validates the value returned by the API? (I am just curious, I know that writing it and keeping it stable for all OSes might be non-trivial)

@VSadov
Copy link
Member Author

VSadov commented Dec 14, 2019

The API is used in other features so indirectly it is tested. The implementation is fairly simple though.


In reply to: 565411681 [](ancestors = 565411681)

@VSadov
Copy link
Member Author

VSadov commented Dec 15, 2019

OSX does not implement GetCurrentProcessorNumber ?

@VSadov VSadov merged commit aedf8f5 into dotnet:master Dec 15, 2019
@VSadov VSadov deleted the CoreId branch December 15, 2019 19:43
@VSadov
Copy link
Member Author

VSadov commented Dec 15, 2019

Thanks!!!

if (currentProcessorId < 0) currentProcessorId = Environment.CurrentManagedThreadId;

// Add offset to make it clear that it is not guaranteed to be 0-based processor number
currentProcessorId += 100;
Copy link
Contributor

@CoffeeFlux CoffeeFlux Jan 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the offset was removed in this cache refactoring PR. Are we going to guarantee a 0-based processor number or not? Whichever we go with, the docs ought to be updated to reflect that and I'll happily PR the change. I think if we're not going to guarantee it, the offset is a good idea and should be reintroduced. @VSadov @jkotas

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not guarantee 0-based processor number.

Whether or not to pay the extra cycle to add the offset is an interesting question. I do not have a strong opinion either way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CoffeeFlux - We cannot guarantee the [0..CpuCount) range. The underlying API may be nonfunctional and managed threadID used instead. On VMs the core ID could be outside of this range too.

Why do you think adding offset is a good idea though?.
It adds some cost to the API while anyone doing something like GetCurrentProcessorId & mask will be tempted to subtract the offset.

Copy link
Contributor

@CoffeeFlux CoffeeFlux Jan 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My fear is customers relying on the behavior anyway and then ending up with Mono bug reports because we can't guarantee it on mobile, but maybe we should just update the docs to be unambiguous about this. My opinion isn't that strong, I just want it to be clarified somewhere other than comments in the source code that we can't guarantee a range of [0..CpuCount).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After sleeping on it, I think I'm okay just leaving the value as-is and updating the docs. I'll PR the change later today.

erozenfeld added a commit to erozenfeld/jitutils that referenced this pull request Mar 20, 2020
1. The timer workaround is no longer needed since EventPipe file polling
was removed in dotnet/coreclr#24225

2. dotnet/runtime#467 introduced a change that
causes pmi non determinism.
System.Threading.Thread.s_isProcessorNumberReallyFast can have different
values on two invocations of the process on the same machine.
(https://github.com/dotnet/runtime/blob/aedf8f52006619ef5d4eca65d79f42cc4b7bc402/src/coreclr/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs#L502)
That causes non-determinism in generated code in methods'
inlining System.Threading.Thread.GetCurrentProcessorId()
(https://github.com/dotnet/runtime/blob/aedf8f52006619ef5d4eca65d79f42cc4b7bc402/src/coreclr/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs#L492-L498)
The workaround is to set the value of
System.Threading.Thread.s_isProcessorNumberReallyFast to true via
reflection.
erozenfeld added a commit to dotnet/jitutils that referenced this pull request Mar 20, 2020
1. The timer workaround is no longer needed since EventPipe file polling
was removed in dotnet/coreclr#24225

2. dotnet/runtime#467 introduced a change that
causes pmi non determinism.
System.Threading.Thread.s_isProcessorNumberReallyFast can have different
values on two invocations of the process on the same machine.
(https://github.com/dotnet/runtime/blob/aedf8f52006619ef5d4eca65d79f42cc4b7bc402/src/coreclr/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs#L502)
That causes non-determinism in generated code in methods'
inlining System.Threading.Thread.GetCurrentProcessorId()
(https://github.com/dotnet/runtime/blob/aedf8f52006619ef5d4eca65d79f42cc4b7bc402/src/coreclr/src/System.Private.CoreLib/src/System/Threading/Thread.CoreCLR.cs#L492-L498)
The workaround is to set the value of
System.Threading.Thread.s_isProcessorNumberReallyFast to true via
reflection.
@ghost ghost locked as resolved and limited conversation to collaborators Dec 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants