Skip to content

Conversation

@VSadov
Copy link
Member

@VSadov VSadov commented Dec 24, 2025

This is a follow up on recommendations from Scalability Experiments done some time ago.

The Scalability Experiments resulted in many suggestions. In this part we look at overheads of submitting and executing a workitem to the threadpool from the thread scheduling point of view. In particular - this PR tries to minimize changes to the workqueue to scope the changes.
The workqueue related recommendations will be addressed separately.

The threadpool parts are very interconnected though, and sometimes removing one bottleneck results in another one to show up, so some workqueue changes had to be done, just to avoid regressions.

There are also a few "low hanging fruit" fixes for per-workitem overheads like unnecessary fences or too frequent modifications of shared state.
Hopefully this will negate some of the regressions from #121887 (as was reported in #122186)

In this change:

  • fewer operations per work item where possible.
    such as fewer/weaker fences where possible, reporting heartbeat once per dispatch quantum vs. per each workitem, etc..

  • avoid spurious wakes of worker threads. (except, unavoidably, when thread goal is changed - by HillClimb and such).
    only one thread is requested at a time. requesting another thread is conditioned on evidence of work present in the queue (basically the minimum required for correctness).
    as a result a thread that becomes active typically finds work.
    in particular this avoids a cascade of spurious wakes when pool is running out of workitems.

  • stop tracking spinners count in LIFO semaphore.
    we can keep track of spinners, but informational value of knowing the extremely transient count is close to zero, so we should not.

  • no Sleep in LIFO semaphore.
    using spin-Sleep is questionable in a synchronization feature that can block and ask OS to wake a thread deterministically.

  • shortening spinning in the LIFO semaphore to a more affordable value.
    since the LIFO semaphore can perform a blocking wait until condition it wants to see happens, once spinning gets into the range of wait/wake latency, it makes no sense to spin for much longer.
    it is also not uncommon that the work is introduced by non-pool threads, thus the pool threads may have to block just to allow for more work to be scheduled.

@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

@VSadov
Copy link
Member Author

VSadov commented Dec 25, 2025

To measure per-task overhead I use a subset from the following benchmark https://github.com/benaadams/ThreadPoolTaskTesting

Results as measured on Win11 with AMD 7950X 16-Core

The following is set to reduce possible noise:
DOTNET_TieredCompilation=0
DOTNET_GCDynamicAdaptationMode=0

Measurement is in number of tasks per second. Higher is beter.

===== Baseline:

Testing 2,621,440 calls, with GCs after 2,621,440 calls.
Operations per second on 32 Cores
                                                                                                                             Parallelism
                                  Serial          2x          4x          8x         16x         32x         64x        128x        512x
QUWI No Queues (TP)              3.016 M     3.122 M     3.027 M     3.048 M     3.037 M     3.014 M     2.992 M     2.997 M     3.003 M
- Depth    2                     3.005 M     2.958 M     3.026 M     3.002 M     2.935 M     2.970 M     2.990 M     2.977 M     3.043 M
- Depth    4                     2.901 M     2.843 M     2.972 M     2.967 M     3.019 M     2.980 M     3.004 M     2.991 M     3.010 M
- Depth    8                     2.965 M     2.919 M     2.902 M     2.734 M     2.902 M     2.922 M     2.913 M     2.927 M     2.937 M
- Depth   16                     2.934 M     2.914 M     2.820 M     2.896 M     2.917 M     2.928 M     2.932 M     2.877 M     2.910 M
- Depth   32                     2.893 M     2.871 M     2.887 M     2.898 M     2.892 M     2.912 M     2.916 M     2.899 M     2.892 M
- Depth   64                     2.894 M     2.867 M     2.888 M     2.881 M     2.877 M     2.887 M     2.879 M     2.883 M     2.915 M
- Depth  128                     2.902 M     2.882 M     2.917 M     2.908 M     2.908 M     2.904 M     2.915 M     2.904 M     2.901 M
- Depth  512                     2.925 M     2.914 M     2.921 M     2.924 M     2.925 M     2.905 M     2.915 M     2.911 M     2.942 M


QUWI Queue Local (TP)            4.799 M     6.223 M    10.593 M    17.118 M    32.025 M    29.635 M    33.384 M    34.646 M    42.084 M
- Depth    2                     6.213 M    10.461 M    14.443 M    21.303 M    32.522 M    38.689 M    39.372 M    40.590 M    43.064 M
- Depth    4                     9.471 M    14.051 M    21.886 M    32.072 M    39.414 M    44.118 M    44.708 M    44.957 M    45.676 M
- Depth    8                    14.232 M    21.544 M    33.413 M    38.438 M    42.951 M    46.537 M    46.760 M    46.946 M    46.824 M
- Depth   16                    21.784 M    33.438 M    37.524 M    41.762 M    45.507 M    47.363 M    47.675 M    47.967 M    48.034 M
- Depth   32                    33.545 M    40.413 M    43.498 M    46.019 M    48.190 M    48.020 M    48.061 M    48.373 M    47.901 M
- Depth   64                    40.034 M    43.087 M    45.389 M    47.332 M    48.465 M    49.146 M    49.223 M    48.879 M    49.355 M
- Depth  128                    42.980 M    46.443 M    47.577 M    48.383 M    48.804 M    48.962 M    48.755 M    49.365 M    49.517 M
- Depth  512                    47.251 M    49.185 M    49.091 M    49.134 M    49.473 M    49.010 M    49.398 M    49.381 M    49.238 M

With this PR:

Testing 2,621,440 calls, with GCs after 2,621,440 calls.
Operations per second on 32 Cores
                                                                                                                             Parallelism
                                  Serial          2x          4x          8x         16x         32x         64x        128x        512x
QUWI No Queues (TP)              3.063 M     3.083 M     3.047 M     3.063 M     3.003 M     3.033 M     3.037 M     3.011 M     3.026 M
- Depth    2                     3.017 M     2.908 M     2.952 M     3.017 M     3.027 M     2.948 M     2.968 M     3.000 M     3.044 M
- Depth    4                     2.925 M     2.977 M     3.024 M     3.008 M     2.979 M     2.980 M     3.004 M     3.022 M     2.991 M
- Depth    8                     2.990 M     2.927 M     2.821 M     2.721 M     2.915 M     2.951 M     2.942 M     2.915 M     2.975 M
- Depth   16                     2.942 M     2.967 M     2.911 M     2.912 M     3.021 M     2.941 M     2.901 M     2.963 M     2.928 M
- Depth   32                     2.961 M     2.944 M     2.948 M     2.944 M     2.986 M     2.954 M     2.937 M     2.939 M     2.936 M
- Depth   64                     2.968 M     2.952 M     2.948 M     2.960 M     2.959 M     2.944 M     2.949 M     2.951 M     2.956 M
- Depth  128                     2.964 M     2.966 M     2.963 M     2.966 M     2.958 M     2.973 M     2.975 M     2.961 M     2.963 M
- Depth  512                     2.990 M     2.968 M     2.971 M     2.988 M     2.963 M     2.979 M     2.977 M     2.984 M     2.984 M


QUWI Queue Local (TP)            5.492 M    10.456 M    17.804 M    18.884 M    48.797 M   127.532 M   162.716 M   158.277 M   214.897 M
- Depth    2                    11.165 M    19.263 M    17.196 M    29.374 M    76.102 M   161.291 M   160.480 M   178.629 M   209.432 M
- Depth    4                    19.652 M    19.699 M    25.000 M    53.820 M   101.990 M   171.157 M   176.565 M   192.625 M   214.042 M
- Depth    8                    23.519 M    25.474 M    37.569 M    91.185 M   136.847 M   157.626 M   183.437 M   199.642 M   204.375 M
- Depth   16                    27.862 M    41.280 M    76.235 M   118.098 M   159.696 M   200.514 M   197.469 M   209.445 M   211.314 M
- Depth   32                    40.314 M    77.313 M   115.200 M   150.681 M   187.060 M   204.254 M   201.597 M   205.071 M   211.376 M
- Depth   64                    73.297 M   139.082 M   172.258 M   176.718 M   199.829 M   218.152 M   205.455 M   205.482 M   213.325 M
- Depth  128                   133.615 M   176.833 M   186.944 M   199.672 M   205.262 M   207.053 M   201.800 M   211.560 M   215.708 M
- Depth  512                   192.360 M   210.508 M   210.339 M   217.688 M   211.859 M   208.360 M   212.702 M   212.503 M   212.348 M

In QUWI Queue Local we are able to execute a lot more workitems per second.

QUWI No Queues is bottlenecked on the FIFO workitem queue. This PR does not address that part, thus no benefits from concurrency, depth or anything else.

NOTE: this is a microbenchmark! These tasks are very trivial, on the level of "increment a counter".
The benchmark is a good tool for checking on per-task overheads and bottlenecks.
The improvements will vary for actual scenarios where workitems do more work compared to the benchmark.

@VSadov
Copy link
Member Author

VSadov commented Dec 25, 2025

For reference - the same as above, but with
set DOTNET_ThreadPool_UseWindowsThreadPool=1

In this case the task queue is the same, but the thread management is done by the OS.
In particular there is no HIllClimbing and other similar things, thus per-task expenses are a bit less to start with.
(there are some downsides with using OS threadpool, but per-task expenses are smaller, at least in the current implementation)

This variant benefits more from the per-workitem improvements in the PR.

=== Baseline:

Testing 2,621,440 calls, with GCs after 2,621,440 calls.
Operations per second on 32 Cores
                                                                                                                             Parallelism
                                  Serial          2x          4x          8x         16x         32x         64x        128x        512x
QUWI No Queues (TP)              3.075 M     3.068 M     3.061 M     3.077 M     3.030 M     3.082 M     3.058 M     3.056 M     3.045 M
- Depth    2                     3.016 M     3.010 M     3.037 M     2.961 M     3.011 M     3.033 M     2.937 M     2.958 M     3.001 M
- Depth    4                     2.945 M     2.997 M     2.983 M     2.987 M     2.990 M     2.975 M     2.990 M     2.965 M     2.942 M
- Depth    8                     2.963 M     2.909 M     2.953 M     2.977 M     2.912 M     2.991 M     2.947 M     3.001 M     2.977 M
- Depth   16                     2.983 M     2.932 M     2.875 M     2.962 M     2.957 M     2.975 M     2.974 M     2.965 M     2.949 M
- Depth   32                     2.963 M     2.961 M     2.962 M     2.945 M     2.939 M     2.963 M     2.952 M     2.955 M     2.958 M
- Depth   64                     2.951 M     2.955 M     2.957 M     2.962 M     2.947 M     2.959 M     2.944 M     2.957 M     2.960 M
- Depth  128                     2.956 M     2.972 M     2.972 M     2.972 M     2.969 M     2.961 M     2.967 M     2.966 M     2.967 M
- Depth  512                     2.964 M     2.968 M     2.966 M     2.965 M     2.964 M     2.963 M     2.974 M     2.968 M     2.971 M


QUWI Queue Local (TP)            7.631 M    15.943 M    22.686 M    30.837 M    35.689 M    45.171 M    50.381 M    52.533 M    55.251 M
- Depth    2                    12.866 M    21.285 M    32.843 M    27.394 M    39.129 M    51.520 M    52.513 M    53.210 M    54.110 M
- Depth    4                    22.034 M    31.817 M    29.279 M    37.944 M    44.660 M    53.304 M    54.043 M    54.486 M    55.104 M
- Depth    8                    36.834 M    36.869 M    40.448 M    44.281 M    50.549 M    55.034 M    55.973 M    55.923 M    55.973 M
- Depth   16                    45.172 M    40.375 M    44.735 M    48.158 M    52.005 M    56.377 M    56.223 M    56.254 M    56.111 M
- Depth   32                    38.088 M    45.396 M    47.299 M    50.493 M    54.404 M    56.791 M    56.596 M    56.820 M    56.028 M
- Depth   64                    44.804 M    48.991 M    51.455 M    53.755 M    55.848 M    56.931 M    57.380 M    56.553 M    56.975 M
- Depth  128                    48.775 M    51.866 M    53.985 M    55.647 M    56.168 M    57.248 M    57.098 M    57.415 M    56.359 M
- Depth  512                    55.442 M    55.050 M    56.782 M    56.497 M    57.615 M    57.055 M    57.175 M    56.792 M    56.774 M

With this PR:

Testing 2,621,440 calls, with GCs after 2,621,440 calls.
Operations per second on 32 Cores
                                                                                                                             Parallelism
                                  Serial          2x          4x          8x         16x         32x         64x        128x        512x
QUWI No Queues (TP)              3.063 M     3.110 M     3.060 M     3.041 M     3.050 M     3.056 M     3.057 M     3.069 M     3.038 M
- Depth    2                     3.047 M     3.026 M     3.013 M     3.029 M     2.968 M     2.969 M     3.021 M     3.003 M     3.006 M
- Depth    4                     2.947 M     3.036 M     3.076 M     3.012 M     3.041 M     3.028 M     3.011 M     3.037 M     3.053 M
- Depth    8                     2.974 M     3.015 M     3.012 M     3.025 M     3.020 M     3.026 M     3.009 M     3.004 M     3.059 M
- Depth   16                     3.014 M     3.031 M     2.959 M     3.043 M     3.017 M     3.012 M     3.015 M     3.021 M     3.034 M
- Depth   32                     3.015 M     2.995 M     3.037 M     3.014 M     3.022 M     2.991 M     3.030 M     2.994 M     3.041 M
- Depth   64                     2.998 M     3.024 M     3.023 M     3.004 M     3.025 M     3.019 M     3.015 M     2.992 M     3.000 M
- Depth  128                     3.008 M     3.003 M     3.005 M     3.006 M     3.010 M     3.006 M     3.000 M     3.004 M     3.004 M
- Depth  512                     3.013 M     3.010 M     3.015 M     3.013 M     3.015 M     3.015 M     3.010 M     3.018 M     3.030 M


QUWI Queue Local (TP)            7.077 M    13.085 M    27.234 M    33.498 M    54.361 M   163.725 M   223.814 M   230.576 M   238.356 M
- Depth    2                    13.710 M    24.174 M    37.391 M    43.846 M    84.419 M   239.729 M   234.524 M   247.397 M   248.473 M
- Depth    4                    22.145 M    34.966 M    52.797 M    62.646 M   126.954 M   193.050 M   222.039 M   255.051 M   247.278 M
- Depth    8                    35.400 M    56.054 M    65.657 M    93.659 M   166.261 M   239.907 M   252.275 M   235.815 M   259.223 M
- Depth   16                    55.728 M    70.509 M   101.264 M   141.210 M   196.271 M   258.734 M   261.945 M   257.276 M   259.935 M
- Depth   32                    70.777 M    81.802 M   133.855 M   175.980 M   233.571 M   248.360 M   256.966 M   261.785 M   263.864 M
- Depth   64                    89.869 M   144.095 M   203.279 M   207.784 M   243.898 M   261.001 M   256.780 M   264.602 M   261.545 M
- Depth  128                   146.725 M   198.235 M   212.273 M   229.452 M   254.920 M   259.824 M   261.772 M   261.219 M   262.685 M
- Depth  512                   239.916 M   253.174 M   258.385 M   258.077 M   262.625 M   265.057 M   264.875 M   264.383 M   262.972 M

@VSadov VSadov marked this pull request as ready for review December 25, 2025 01:26
Copilot AI review requested due to automatic review settings December 25, 2025 01:27
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements performance improvements to the thread pool by reducing per-work-item overhead and minimizing spurious thread wakeups. The primary focus is on optimizing thread scheduling and synchronization while maintaining correctness.

Key changes include:

  • Introducing a single outstanding thread request flag (_hasOutstandingThreadRequest) to replace the counter-based approach, preventing thundering herd issues
  • Reducing memory barriers and fence operations in critical paths where volatile semantics provide sufficient guarantees
  • Simplifying semaphore spinning logic and removing spinner tracking overhead
  • Implementing exponential backoff for contended interlocked operations
  • Deferring work queue assignment to reduce lock contention during dispatch startup

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
WindowsThreadPool.cs Adds CacheLineSeparated struct with thread request flag and implements EnsureWorkerRequested() with check-exchange pattern to prevent duplicate requests
ThreadPoolWorkQueue.cs Removes unnecessary volatile writes in LocalPush, eliminates _mayHaveHighPriorityWorkItems flag, defers queue assignment, and updates thread request calls
ThreadPool.Windows.cs Changes YieldFromDispatchLoop to accept currentTickCount parameter and renames RequestWorkerThread to EnsureWorkerRequested
ThreadPool.Unix.cs Updates YieldFromDispatchLoop signature to call NotifyDispatchProgress and renames RequestWorkerThread
ThreadPool.Wasi.cs Updates YieldFromDispatchLoop signature with pragma to suppress unused parameter warning
ThreadPool.Browser.cs Updates YieldFromDispatchLoop signature with pragma to suppress unused parameter warning
ThreadPool.Browser.Threads.cs Updates YieldFromDispatchLoop signature with pragma to suppress unused parameter warning
PortableThreadPool.cs Renames lastDequeueTime to lastDispatchTime, replaces numRequestedWorkers with _hasOutstandingThreadRequest, refactors NotifyWorkItemProgress methods
PortableThreadPool.WorkerThread.cs Reduces semaphore spin count from 70 to 9, refactors WorkerDoWork to use check-exchange pattern, implements TryRemoveWorkingWorker with overflow handling
PortableThreadPool.ThreadCounts.cs Adds IsOverflow property and TryIncrement/DecrementProcessingWork methods to manage overflow state using high bit of _data
PortableThreadPool.GateThread.cs Updates starvation detection to use _hasOutstandingThreadRequest and lastDispatchTime
PortableThreadPool.Blocking.cs Updates blocking adjustment logic to check _hasOutstandingThreadRequest
LowLevelLifoSemaphore.cs Removes spinner tracking, changes Release(count) to Signal(), simplifies Wait() to remove spinWait parameter, restructures Counts bit layout
Backoff.cs Introduces new exponential backoff utility using stack address-based randomization for collision retry scenarios
System.Private.CoreLib.Shared.projitems Adds Backoff.cs to project compilation

@VSadov
Copy link
Member Author

VSadov commented Dec 29, 2025

Rebased onto most recent main.

@VSadov
Copy link
Member Author

VSadov commented Dec 29, 2025

Impact on TE benchmarks on linux.
I am using Json scenario (as in json.benchmarks.yml --scenario json --profile aspnet-gold-lin )
That is INTEL, 56 logical cores, 1 socket, 1 NUMA, 64 GB.

TE benchmarks are often more sensitive to the promptness of workitem execution (which this change addresses only indirectly), vs. overhead of a single task. Thus I am mostly looking to just not have show-stopping regressions.

I see small (~1%) improvement in throughput. Roughly the same latency.

=== Baseline:

| First Request (ms)        | 312                 |
| Requests/sec              | 1,795,340           |
| Requests                  | 27,109,256          |
| Mean latency (ms)         | 0.14                |
| Max latency (ms)          | 9.68                |
| Bad responses             | 0                   |
| Socket errors             | 0                   |
| Read throughput (MB/s)    | 285.93              |
| Latency 50th (ms)         | 0.12                |
| Latency 75th (ms)         | 0.17                |
| Latency 90th (ms)         | 0.23                |
| Latency 99th (ms)         | 0.35                |

=== New:

| First Request (ms)        | 295                 |
| Requests/sec              | 1,827,391           |
| Requests                  | 27,593,647          |
| Mean latency (ms)         | 0.14                |
| Max latency (ms)          | 9.33                |
| Bad responses             | 0                   |
| Socket errors             | 0                   |
| Read throughput (MB/s)    | 291.04              |
| Latency 50th (ms)         | 0.12                |
| Latency 75th (ms)         | 0.16                |
| Latency 90th (ms)         | 0.22                |
| Latency 99th (ms)         | 0.36                |

@VSadov
Copy link
Member Author

VSadov commented Dec 29, 2025

Same benchmark on windows, on a similar machine. INTEL, 56 logical cores, 1 socket, 1 NUMA, 64 GB.
(json.benchmarks.yml --scenario json --profile aspnet-gold-win)

I see roughly the same 1% throughput improvement.

The latency has a lot of noise and differs from run to run, either with or without the change. It looks roughly in the same range though.

Baseline:

| First Request (ms)        | 233                 |
| Requests/sec              | 1,378,796           |
| Requests                  | 20,819,577          |
| Mean latency (ms)         | 3.57                |
| Max latency (ms)          | 244.53              |
| Bad responses             | 0                   |
| Socket errors             | 0                   |
| Read throughput (MB/s)    | 219.59              |
| Latency 50th (ms)         | 0.11                |
| Latency 75th (ms)         | 0.48                |
| Latency 90th (ms)         | 4.94                |
| Latency 99th (ms)         | 78.33               |

New:

| First Request (ms)        | 245                 |
| Requests/sec              | 1,394,492           |
| Requests                  | 21,056,936          |
| Mean latency (ms)         | 1.31                |
| Max latency (ms)          | 256.58              |
| Bad responses             | 0                   |
| Socket errors             | 0                   |
| Read throughput (MB/s)    | 222.09              |
| Latency 50th (ms)         | 0.10                |
| Latency 75th (ms)         | 0.55                |
| Latency 90th (ms)         | 2.89                |
| Latency 99th (ms)         | 19.51               |

@VSadov
Copy link
Member Author

VSadov commented Jan 6, 2026

One interesting observation is that after this change we generally can only have one signal at a time in the semaphore because

  1. a worker may be activated only in response to a new request being set.
  2. an activated worker will always see a request set and will clear it before doing Dispatch.
    (and likely will see workitems in the queue, although not guaranteed as existing workers keep dispatching while there is work)
  3. an activated worker can clear a request only one time per being activated.
  4. the "overflow" state allows for one "virtual" activation that does not involve the semaphore - the thread that clears the overflow still acts on the most recent request, it just skips the signal/wait and goes directly to the "claim the request and do Dispatch" part.

Either way we can't see two concurrent signals in the semaphore, virtual or real, as signaling requires setting a thread request as a prerequisite, and clearing a request requires consuming a signal.

One part that breaks the above invariants is that we allow active thread goal to change dynamically. (SetMinThreads API, HillClimb, starvation detection, ...). That can allow an extra worker activation while the same request is pending.
None of that is a common scenario though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants