[WIP] Deactivate ThreadPool local queues when idle#21713
Conversation
|
Doesn't this potentially significantly increase both allocation and contention if a thread repeatedly adds its local queue and then removes it, which would happen if it processed a work item that queued a single local work item? It'd register its list, then process its own item, and then remove the list. Seems like this could result in significant regressions. What am I missing? How are you validating this change, both for the desired improvements and against such regressions? |
|
I have been looking at #19088 and similar for a long time. It is not just having to scan through empty queues. The queues that have only few items are also a problem. Ultimately you have to scan queues for correctness, and no matter how you spread the cost, when there are lots of queues it gets expensive. Short queues in particular - because of contentions and false sharing. I think I am getting ready to discuss #18403 as a more permanent solution. |
|
Will close this and see where that goes :) |
Can you elaborate? You stop scanning the moment you successfully remove an item from a queue. Are you saying you're seeing contention on queues with, say, only one item causing a problem, as multiple threads all try to take from it, fail, and then continue scanning? I'd have guessed that condition would be relatively rare, in particular with threads all starting their scan at different locations. |
|
Just an observation from the work stealing theory -
Empty queues are worse than short ones, obviously :-), since the work spent on them cannot yield a workitem. |
|
To be sure - we are not talking about common cases. In typical loads our thread pool works wonderfully. In fact, according to literature, most thread scheduling strategies are adequate in the common case - just because the actual "work" dominates over the scheduling by a far margin. It is mostly about the tolerance to the less common inconvenient cases where threadpool may become a bottleneck. |
|
Our "inconvenient" case is where we need more TP workers than CPU cores to compensate for worker latencies and bursty loads. Sometimes we need a lot more workers than cores. The changes in #18403 are trying to address the root cause by keeping the It looks like the extra complexity generally pays for itself and the benchmark performance is roughly the same or better. I think the changes are very promising, but could use more testing/tuning of course. |
Defer adding local queue to list of active queues to potentially steal from until to first local item queued rather than thread creation.
Remove thread's local queue from list of active queues if the thread finds no work to do i.e. nothing in global or its own local queue (or any other local queue/missed steal) - e.g. thread is idling.
Readd when next local item is queued (e.g. back to step 1.)
Resolves: #19088