Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

[WIP] Deactivate ThreadPool local queues when idle#21713

Closed
benaadams wants to merge 1 commit into
dotnet:masterfrom
benaadams:localqueues
Closed

[WIP] Deactivate ThreadPool local queues when idle#21713
benaadams wants to merge 1 commit into
dotnet:masterfrom
benaadams:localqueues

Conversation

@benaadams
Copy link
Copy Markdown
Member

@benaadams benaadams commented Dec 30, 2018

  1. Defer adding local queue to list of active queues to potentially steal from until to first local item queued rather than thread creation.

  2. Remove thread's local queue from list of active queues if the thread finds no work to do i.e. nothing in global or its own local queue (or any other local queue/missed steal) - e.g. thread is idling.

  3. Readd when next local item is queued (e.g. back to step 1.)

Resolves: #19088

@stephentoub
Copy link
Copy Markdown
Member

stephentoub commented Dec 30, 2018

Doesn't this potentially significantly increase both allocation and contention if a thread repeatedly adds its local queue and then removes it, which would happen if it processed a work item that queued a single local work item? It'd register its list, then process its own item, and then remove the list. Seems like this could result in significant regressions. What am I missing? How are you validating this change, both for the desired improvements and against such regressions?

@benaadams benaadams changed the title Deactivate ThreadPool local queues when idle [WIP] Deactivate ThreadPool local queues when idle Dec 30, 2018
@VSadov
Copy link
Copy Markdown
Member

VSadov commented Dec 30, 2018

I have been looking at #19088 and similar for a long time.
I am pretty convinced that the root cause is the "queue per thread" approach itself.
There are several ways how it may stand in a way of scalability or lead into anomalous and hard to recover from behaviors. And it gets worse on machines with more cores.

It is not just having to scan through empty queues. The queues that have only few items are also a problem. Ultimately you have to scan queues for correctness, and no matter how you spread the cost, when there are lots of queues it gets expensive. Short queues in particular - because of contentions and false sharing.
The whole tendency to get into "numerous shallow queues" situation is harmful to the efficiency of the pool.

I think I am getting ready to discuss #18403 as a more permanent solution.

@benaadams
Copy link
Copy Markdown
Member Author

Will close this and see where that goes :)

@benaadams benaadams closed this Dec 30, 2018
@stephentoub
Copy link
Copy Markdown
Member

The queues that have only few items are also a problem

Can you elaborate? You stop scanning the moment you successfully remove an item from a queue. Are you saying you're seeing contention on queues with, say, only one item causing a problem, as multiple threads all try to take from it, fail, and then continue scanning? I'd have guessed that condition would be relatively rare, in particular with threads all starting their scan at different locations.

@VSadov
Copy link
Copy Markdown
Member

VSadov commented Dec 31, 2018

Just an observation from the work stealing theory -

  • Ideally all items are popped. Stealing is basically a more expensive fallback codepath and should be rare (ideally).
  • when stealing happens, it is better if the queue is long since that would reduce chances that thief interacts with the pusher/popper (that is where contention and false sharing would happen).
  • longer queues also make it less likely that poppers will go through all their items and will have to start stealing.

Empty queues are worse than short ones, obviously :-), since the work spent on them cannot yield a workitem.
But short/unbalanced queues are also undesirable since they move you away from the most efficient mode.

@VSadov
Copy link
Copy Markdown
Member

VSadov commented Dec 31, 2018

To be sure - we are not talking about common cases. In typical loads our thread pool works wonderfully.

In fact, according to literature, most thread scheduling strategies are adequate in the common case - just because the actual "work" dominates over the scheduling by a far margin.
There is not a lot of motivation to improve such case.

It is mostly about the tolerance to the less common inconvenient cases where threadpool may become a bottleneck.

@VSadov
Copy link
Copy Markdown
Member

VSadov commented Dec 31, 2018

Our "inconvenient" case is where we need more TP workers than CPU cores to compensate for worker latencies and bursty loads. Sometimes we need a lot more workers than cores.
In such cases we end up with unnecessarily many work queues, which bound to be shallow (the same total number of tasks divided over more queues) and often misbalanced.

The changes in #18403 are trying to address the root cause by keeping the #queues == #cores and by spending some extra effort to keep work queues balanced. - so that we end up operating closer to the ideal mode even when the actual work load is far from ideal.

It looks like the extra complexity generally pays for itself and the benchmark performance is roughly the same or better.
At the same time the scenarios like in #19088 are no longer a problem.

I think the changes are very promising, but could use more testing/tuning of course.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants