Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Some more perf tweaks to ThreadPool#9279

Merged
stephentoub merged 9 commits intodotnet:masterfrom
stephentoub:more_tp_perf
Feb 3, 2017
Merged

Some more perf tweaks to ThreadPool#9279
stephentoub merged 9 commits intodotnet:masterfrom
stephentoub:more_tp_perf

Conversation

@stephentoub
Copy link
Copy Markdown
Member

Main changes:

  • The work-stealing system was storing all of the local lists in a sparse array. Any time a thread went looking for work, it would pick a random place to start in the array and iterate through all of the cells. But the array often was much larger than the number of lists it contained, because it a) grows doubling each time, and b) never shrinks when threads are removed, so all of the iterating needed to check for null on every cell. And all of the reads were done with Volatile.Read, to account for potentially concurrent writes as threads were added/removed. However, the addition/removal of thread-local lists is actually relatively rare (even when the thread pool takes threads in and out of service, it's not generally not allocating new threads or destroying old ones, just keeping them in reserve on the side for a while). Instead, we can just use an immutable array that's always the right size.
  • All of the old StackCrawlMark was still cluttering up {Unsafe}QueueUserWorkItem and preventing those functions and helpers from getting inlined. I've removed that goo.
  • Random.Next is used to determine where a thread should start looking for work to steal, and it was showing up in a trace as a few percentage points. I've replaced it with a faster routine.
  • Cleaned up some empty try blocks and the associated code around it, and helped a few other functions that were showing up in traces be more inlineable.

cc: @jkotas, @kouvel, @benaadams

@@ -534,81 +536,66 @@ internal static bool Dispatch()
// Dequeue and EnsureThreadRequested must be protected from ThreadAbortException.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment not applicable anymore

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

}

// Simple random number generator. We don't need great randomness, we just need a little and for it to be fast.
internal sealed class FastRandom // xorshift prng
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

struct?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to a struct

@stephentoub
Copy link
Copy Markdown
Member Author

@jkotas, I've not seen this error before (in the arm Cross Debug Build leg):

error C3859: virtual memory range for PCH exceeded; please recompile with a command line option of '-Zm58' or greater [D:\j\workspace\arm_cross_deb---74b8368e\bin\obj\Windows_NT.arm.Debug\crossgen\src\vm\crossgen\cee_crossgen.vcxproj]

I assume this has nothing to do with my PR?

@jkotas
Copy link
Copy Markdown
Member

jkotas commented Feb 2, 2017

I do not think it has anything to do with your PR. I have not seen it before.

@stephentoub
Copy link
Copy Markdown
Member Author

@dotnet-bot test Windows_NT arm Cross Debug Build please

_w = _w ^ (_w >> 19) ^ (t ^ (t >> 8));

int r = (int)_w & int.MaxValue;
return (int)(r * (1.0 / ((double)int.MaxValue + 1)) * maxValue);
Copy link
Copy Markdown
Member

@benaadams benaadams Feb 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As much as idiv bothers me; I wonder what the effect of casting to double doing math and casting back; since vectors are increasingly in play; may be better just to % maxValue? (with early exit for maxValue == 0)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll switch to mod. Looks like it might be 1-2% faster, but regardless it's simpler.

Copy link
Copy Markdown
Member

@benaadams benaadams Feb 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still wonder if light coordination between the threads using interlocked might be better than random to avoid collisions

As in rather than
int i = tl.random.Next(c);
do
int i = NextCounter.Next(c);

with it being something like https://gist.github.com/benaadams/9a66811bc54e7126a9cd8c181bbe6f53

Maybe something to look at in future.

Copy link
Copy Markdown
Member Author

@stephentoub stephentoub Feb 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Ben. Please let me know what you find (if you look).

@benaadams
Copy link
Copy Markdown
Member

LGTM

The current data structure used to store the list of local work-stealing queues is maintains a sparse array, where entries can be null, and where removals null out entries in an active array that could be being read by another thread concurrently.  This necessitates that threads looking for work need to use volatile reads and null checks on each element.  Further, because the array doubles in size when it grows, and never shrinks, we often end up having many empty slots that threads need to look at as they're looking for work.

It's actually relatively rare for threads to come and go.  While the thread pool does take threads in and out of service, it only rarely actually terminates a thread or asks the OS for a new one, so it's relatively rare that threads are added/removed from the list.  Given that, we can simply use immutable arrays, creating a new array of the exact right size whenever a thread is added or removed.  Then iteration can be done without volatile reads and without null checks, because the contents of the array being read through won't change and won't ever be null.
And clean up the code around it.
We need some randomness, but it doesn't need to be particularly good, just fast.  A simple xorshift prng is 3-4x faster than Random, in particular without multiple virtual method calls that don't get inlined.  It's also a bit lighter in terms of memory (though there's only one of these per thread, so it doesn't impact much).
- Change LocalPop to be parameterless and return the IThreadPoolWorkItem rather than returning a bool and having the IThreadPoolWorkItem in an out arg.
- Refactor LocalPop slightly to have an inlineable upfront check for whether the local queue is empty, and only call the non-inlineable method if there's something that could potentially be popped
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
Some more perf tweaks to ThreadPool

Commit migrated from dotnet/coreclr@cfd28c5
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants