Skip to content

Conversation

@VSadov
Copy link
Member

@VSadov VSadov commented Dec 5, 2019

GetCurrentProcessorNumber is capped to 64 on Windows and that results in unexpected sharing when having 64+ cores. In particular if the processor groups are in different NUMA nodes.
We need to use GetCurrentProcessorNumberEx.

GCToOSInterface::GetCurrentProcessorNumber has another implementation, which looks correct. This is basically a short version of that.


#ifndef FEATURE_PAL
PROCESSOR_NUMBER proc_no_cpu_group;
GetCurrentProcessorNumberEx(&proc_no_cpu_group);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it OK if ThreadNative::GetCurrentProcessorNumber might return an index that is larger than total number of active processors on the system?

Copy link
Member Author

@VSadov VSadov Dec 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. What we return to the user is technically “ID correlated with last core we ran on”. We may even default to ThreadID if OS API is not functional. (PAL may return -1).

Are process groups contiguous? (1,2,3, ...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To answer my question - yes, OS "packs" cores into as few process groups as possible and considers topology when assigning.
https://docs.microsoft.com/en-us/windows/win32/procthread/processor-groups

@sergiy-k
Copy link
Contributor

sergiy-k commented Dec 5, 2019

/cc: @janvorli

@janvorli
Copy link
Member

janvorli commented Dec 5, 2019

GetCurrentProcessorNumber is capped to 64 on Windows and that results in unexpected sharing when having 64+ cores.

It is not capped, it returns number of processor within the current processor group. Processor groups cannot have more than 64 CPUs.
I am not sure if returning a combined processor number the way you've done it (group * 64 + in_group_cpu_index) is something that we should make public, as Windows have no concept of CPU index larger than 64, and groups don't necessarily have to be completely filled. That means that the range of CPU indices would not be necessarily continuous.
In GC code, it is actually used as an internal encoding of the group / index, we always decode it back to group / index before using it.

@VSadov
Copy link
Member Author

VSadov commented Dec 5, 2019

The goal here is to form an integer that user can use to softly affinitize data to cores.
This is not to describe topology. If interesting, we may need another API for that.

Returning proc number within a group is clearly wrong since it maps all cores into 0-64 range.

If there is a better way to produce a CoreID - how?

@jkotas
Copy link
Member

jkotas commented Dec 5, 2019

Any difference in perf compared to what we have today?

@VSadov
Copy link
Member Author

VSadov commented Dec 5, 2019

@jkotas - perf difference in API itself due to << and | or perf difference on 256core machine where every proc #N shares data with procs #N in 3 other NUMA nodes?

The latter really depends on the app.
The sharing is very unintuitive though. Even if we wanted to compress the ProcID range to 64, I'd prefer to share within nodes, not across.

@VSadov
Copy link
Member Author

VSadov commented Dec 5, 2019

on machines with fast GetCurrentProcessorNumber this change adds 5-10% to the cost of the FCALL.

The following are measurements as reported by printf-instrumented #467 when it is forced to calibrate FCALL against standalone access to ThreadStatic
I picked just the best measurements, since there is some noise.

================= Main machine (12 logical cores, Coffeelake 4.2 GHz)

100ns tick (Stopwatch)
4096 iters

times (in ticks for one iteration)
=== ignoring cpu group (just calling GetCurrentProcessorNumber)
ID: 0.08984375 TLS: 0.04443359375

=== considering cpu group (calling and folding GetCurrentProcessorNumberEx)
ID: 0.095703125 TLS: 0.04443359375

adds 6.5% to ID call

================== Older machine (8 logical cores, Kabylake 4.0 GHz)
284ns tick
2048 iters

times (in ticks)
=== ignoring cpu group
ID: 0.03466796875 TLS: 0.015625

=== considering cpu group
ID: 0.0361328125 TLS: 0.0146484375

adds 4% to ID call

================== Rome (256 logical cores, Zen2, 2.4 GHz, has RDPID)
100ns tick
4096 iters

times (in ticks)
=== ignoring cpu group
ID: 0.0455322265625 TLS: 0.074462890625

=== considering cpu group
ID: 0.0498046875 TLS: 0.074462890625

adds 9% per ID call

@jkotas
Copy link
Member

jkotas commented Dec 5, 2019

Sounds reasonable.

@VSadov VSadov merged commit a6d62d6 into dotnet:master Dec 6, 2019
@VSadov VSadov deleted the BigProcNum branch December 6, 2019 02:34
@ghost ghost locked as resolved and limited conversation to collaborators Dec 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants