API to specify number of threads, from threadpool, to use for the task#32
API to specify number of threads, from threadpool, to use for the task#32kimishpatel wants to merge 1 commit intogoogle:mainfrom
Conversation
Summary: This PR adds a way use fewer threads than configured with in pthreadpool. Occassionaly it has been seen that using the # of thredas = logical core is not efficient. This can be due to system load and varying other factors that lead threads either being mapped to slower cores or being mapped to fewer than logical core. Thus this PR attempt to fix this. Approach: Add api to set thread local var for specifying the #of threads to use. pthreadpool_parallelize will then distributed the work only among specified threads. Threads that are not picked continue to wait, likely via mutex/condvar, for next chunk of work and thus give up their cpu slot. Both pthreads.c windows.c are modified to add this feature. In the original PR Maratyszcza#17, there were quite a few suggestions made. Some fundamental ones, such as - using thread local num_threads_to_use vs make this a property of threadpool object itself, remained unresolved. - Waking up all the threads even when they dont participate in the work (this one definitely makes sense) It would make sense to resolve those in this iteration. Test Plan: 4 tests are added to check this. Reviewers: Subscribers: Tasks: Tags:
| for (size_t tid = 1; tid < threads_count; tid++) { | ||
| /* Lock the command variable to ensure that threads don't shutdown until | ||
| * both command and active_threads are updated */ | ||
| pthread_mutex_lock(&threadpool->threads[tid].command_mutex); |
There was a problem hiding this comment.
This pattern (locking n mutexes in a loop) is very risky. If two threads are trying to do this at the same time, and each are able to get some of the locks but not all of them, this is a deadlock. It seems like maybe if you always attempt to lock mutex i only after locking mutex i - 1, maybe this is safe from that problem, but I'm not sure.
I'm also nervous about the performance impact of this change. Currently, pthreadpool is O(1) to distribute work to worker threads. This is fairly unique among thread pools. I've tried to change this in order to better accommodate multiple threads using pthreadpool simultaneously (currently pthreadpool serializes such use cases), but it's very difficult to do without very measurable performance regressions.
There was a problem hiding this comment.
This pattern (locking n mutexes in a loop) is very risky. If two threads are trying to do this at the same time, and each are able to get some of the locks but not all of them, this is a deadlock. It seems like maybe if you always attempt to lock mutex i only after locking mutex i - 1, maybe this is safe from that problem, but I'm not sure.
Yeah thats a really good point and I agree with you about the specific pattern. I think whats different here is that other threads that can contend for the lock are contending for their own mutex and not some other thread. So what can happen for example is:
- Main thread locks thread[1].command_mutex.
- Thread 2 locks its own thread[2].command_mutex
- Main thread tries to lock thread[2].command_mutex.
In the above scenario 2 and 3 can result in deadlock. However thread 2 locks its command_mutex to observe if new command has arrived and when that is not the case it enter conditional wait via pthread_cond_wait which unlocks the mutex and waits on the condvar which is a blocking wait (spurious wakeups can happen but that would accompany locking of the mutex). At that point 3 should succeed in acquiring the mutex.
I do agree with you that this does require careful consideration so maybe it warrants more thought exercise, but I suspect it should be ok.
I'm also nervous about the performance impact of this change. Currently, pthreadpool is O(1) to distribute work to worker threads. This is fairly unique among thread pools. I've tried to change this in order to better accommodate multiple threads using pthreadpool simultaneously (currently pthreadpool serializes such use cases), but it's very difficult to do without very measurable performance regressions
Yeah performance concerns are valid. I hvae kept the blocking nature of pthreadpool but the one where perf concerns become clear are the spurious wakeups current impl does. It wakes up non participating threads just to have them go back to sleep. I do think we should think about how to fix that.
Wrt "accommodate multiple threads using pthreadpool simultaneously", I have honestly thought about this as well but that is significantly more intrusive design change of pthreadpool so I have refrained from it. My way out on this has been bring your own instance of pthreadpool. So another thread wanting to parallelize its work will have to use another instance of pthreadpool. But yeah I dont have a good solution on this either although this problem has definitely haunted me.
There was a problem hiding this comment.
In the above scenario 2 and 3 can result in deadlock. However thread 2 locks its command_mutex to observe if new command has arrived and when that is not the case it enter conditional wait via pthread_cond_wait which unlocks the mutex and waits on the condvar which is a blocking wait (spurious wakeups can happen but that would accompany locking of the mutex). At that point 3 should succeed in acquiring the mutex.
What if two threads are calling pthreadpool_parallelize at once? That's two threads trying to lock n mutexes (without unlocking) in a loop, isn't it?
I think this basically needs the deadlock avoidance algorithm found in std::lock, and I think that algorithm could require O(2^n) pthread_mutex_trylock calls to get n mutexes in the case of contention.
There was a problem hiding this comment.
Doesnt
/* Protect the global threadpool structures */
pthread_mutex_lock(&threadpool->execution_mutex);
prevent what you are describing? That access to pthreadpool_parallelize itself is serialized
I think this basically needs the deadlock avoidance algorithm found in std::lock, and I think that algorithm could require O(2^n) pthread_mutex_trylock calls to get n mutexes in the case of contention.
Ok this I dont know a whole lot about so likely need to understand it better
There was a problem hiding this comment.
maybe if you always attempt to lock mutex i only after locking mutex i - 1, maybe this is safe from that problem, but I'm not sure.
I thought having a total order on locks was a general method to guarantee freedom from deadlock. I'm not super familiar with pthreadpool though, so of course I can't guarantee that there is no other place where some thread holds its own lock and then tries to acquire some other thread's lock, not necessarily in this order.
There was a problem hiding this comment.
so of course I can't guarantee that there is no other place where some thread holds its own lock
This is my concern as well. pthreadpool is an unfortunately large codebase, and there definitely are other cases of calling pthread_mutex_lock in a way that might be safe, but also might not be, and it's difficult for me to tell. And even if it's safe now, it's a land mine for someone else to step on in the future...
There was a problem hiding this comment.
it's a land mine for someone else to step on in the future
To be honest the current implementation seems safe to me. But I understand your concerns around this being a land-mine and testing something like this is much more challenging. But then it also feels like you will only ever make incrementally safe change to something that is widely deployed, which definitely makes sense.
One thing I probably will want to follow up is to see if creating separate threadpools and switching between those will offer better alternative (measure perf cost)
| void pthreadpool_set_num_threads_to_use(size_t num_threads) { | ||
| max_num_threads = num_threads; | ||
| } | ||
|
|
||
| size_t pthreadpool_get_num_threads_to_use() { return max_num_threads; } |
There was a problem hiding this comment.
it's confusing that we now have a thing on each threadpool called num_threads_to_use, but this is a separate unrelated global thing also called num_threads_to_use
| * As per this change, this feature is not available in GCD based | ||
| * pthreadpool |
| * events and switch event in use after every submitted command according to | ||
| * the high bit of the command word. | ||
| */ | ||
| HANDLE command_event[2]; |
| for (size_t tid = 1; tid < threads_count; tid++) { | ||
| /* Lock the command variable to ensure that threads don't shutdown until | ||
| * both command and active_threads are updated */ | ||
| pthread_mutex_lock(&threadpool->threads[tid].command_mutex); |
There was a problem hiding this comment.
maybe if you always attempt to lock mutex i only after locking mutex i - 1, maybe this is safe from that problem, but I'm not sure.
I thought having a total order on locks was a general method to guarantee freedom from deadlock. I'm not super familiar with pthreadpool though, so of course I can't guarantee that there is no other place where some thread holds its own lock and then tries to acquire some other thread's lock, not necessarily in this order.
Summary:
This PR adds a way use fewer threads than configured with in pthreadpool. Occassionaly it has been seen that using the # of thredas = logical core is not efficient. This can be due to system load and varying other factors that lead threads either being mapped to slower cores or being mapped to fewer than logical core.
Thus this PR attempt to fix this.
Approach:
Add api to set thread local var for specifying the #of threads to use. pthreadpool_parallelize will then distributed the work only among specified threads.
Threads that are not picked continue to wait, likely via mutex/condvar, for next chunk of work and thus give up their cpu slot. Both pthreads.c windows.c are modified to add this feature.
In the original PR Maratyszcza#17, there were quite a few suggestions made. Some fundamental ones, such as
It would make sense to resolve those in this iteration.
Test Plan:
4 tests are added to check this.
Reviewers:
Subscribers:
Tasks:
Tags: