Skip to content

API to specify number of threads, from threadpool, to use for the task#32

Open
kimishpatel wants to merge 1 commit intogoogle:mainfrom
kimishpatel:use_subset_threadpool
Open

API to specify number of threads, from threadpool, to use for the task#32
kimishpatel wants to merge 1 commit intogoogle:mainfrom
kimishpatel:use_subset_threadpool

Conversation

@kimishpatel
Copy link

Summary:
This PR adds a way use fewer threads than configured with in pthreadpool. Occassionaly it has been seen that using the # of thredas = logical core is not efficient. This can be due to system load and varying other factors that lead threads either being mapped to slower cores or being mapped to fewer than logical core.
Thus this PR attempt to fix this.

Approach:

Add api to set thread local var for specifying the #of threads to use. pthreadpool_parallelize will then distributed the work only among specified threads.
Threads that are not picked continue to wait, likely via mutex/condvar, for next chunk of work and thus give up their cpu slot. Both pthreads.c windows.c are modified to add this feature.

In the original PR Maratyszcza#17, there were quite a few suggestions made. Some fundamental ones, such as

  • using thread local num_threads_to_use vs make this a property of threadpool object itself, remained unresolved.
  • Waking up all the threads even when they dont participate in the work (this one definitely makes sense)

It would make sense to resolve those in this iteration.

Test Plan:
4 tests are added to check this.

Reviewers:

Subscribers:

Tasks:

Tags:

Summary:
This PR adds a way use fewer threads than configured with in
pthreadpool. Occassionaly it has been seen that using the # of thredas =
logical core is not efficient. This can be due to system load and
varying other factors that lead threads either being mapped to slower
cores or being mapped to fewer than logical core.
Thus this PR attempt to fix this.

Approach:

Add api to set thread local var for specifying the #of threads to use.
pthreadpool_parallelize will then distributed the work only among
specified threads.
Threads that are not picked continue to wait, likely via mutex/condvar,
for next chunk of work and thus give up their cpu slot.
Both pthreads.c windows.c are modified to add this feature.

In the original PR Maratyszcza#17,
there were quite a few suggestions made. Some fundamental ones, such as
- using thread local num_threads_to_use vs make this a property of
threadpool object itself, remained unresolved.
- Waking up all the threads even when they dont participate in the work
  (this one definitely makes sense)

It would make sense to resolve those in this iteration.

Test Plan:
4 tests are added to check this.

Reviewers:

Subscribers:

Tasks:

Tags:
for (size_t tid = 1; tid < threads_count; tid++) {
/* Lock the command variable to ensure that threads don't shutdown until
* both command and active_threads are updated */
pthread_mutex_lock(&threadpool->threads[tid].command_mutex);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pattern (locking n mutexes in a loop) is very risky. If two threads are trying to do this at the same time, and each are able to get some of the locks but not all of them, this is a deadlock. It seems like maybe if you always attempt to lock mutex i only after locking mutex i - 1, maybe this is safe from that problem, but I'm not sure.

I'm also nervous about the performance impact of this change. Currently, pthreadpool is O(1) to distribute work to worker threads. This is fairly unique among thread pools. I've tried to change this in order to better accommodate multiple threads using pthreadpool simultaneously (currently pthreadpool serializes such use cases), but it's very difficult to do without very measurable performance regressions.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pattern (locking n mutexes in a loop) is very risky. If two threads are trying to do this at the same time, and each are able to get some of the locks but not all of them, this is a deadlock. It seems like maybe if you always attempt to lock mutex i only after locking mutex i - 1, maybe this is safe from that problem, but I'm not sure.

Yeah thats a really good point and I agree with you about the specific pattern. I think whats different here is that other threads that can contend for the lock are contending for their own mutex and not some other thread. So what can happen for example is:

  1. Main thread locks thread[1].command_mutex.
  2. Thread 2 locks its own thread[2].command_mutex
  3. Main thread tries to lock thread[2].command_mutex.

In the above scenario 2 and 3 can result in deadlock. However thread 2 locks its command_mutex to observe if new command has arrived and when that is not the case it enter conditional wait via pthread_cond_wait which unlocks the mutex and waits on the condvar which is a blocking wait (spurious wakeups can happen but that would accompany locking of the mutex). At that point 3 should succeed in acquiring the mutex.

I do agree with you that this does require careful consideration so maybe it warrants more thought exercise, but I suspect it should be ok.

I'm also nervous about the performance impact of this change. Currently, pthreadpool is O(1) to distribute work to worker threads. This is fairly unique among thread pools. I've tried to change this in order to better accommodate multiple threads using pthreadpool simultaneously (currently pthreadpool serializes such use cases), but it's very difficult to do without very measurable performance regressions

Yeah performance concerns are valid. I hvae kept the blocking nature of pthreadpool but the one where perf concerns become clear are the spurious wakeups current impl does. It wakes up non participating threads just to have them go back to sleep. I do think we should think about how to fix that.

Wrt "accommodate multiple threads using pthreadpool simultaneously", I have honestly thought about this as well but that is significantly more intrusive design change of pthreadpool so I have refrained from it. My way out on this has been bring your own instance of pthreadpool. So another thread wanting to parallelize its work will have to use another instance of pthreadpool. But yeah I dont have a good solution on this either although this problem has definitely haunted me.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the above scenario 2 and 3 can result in deadlock. However thread 2 locks its command_mutex to observe if new command has arrived and when that is not the case it enter conditional wait via pthread_cond_wait which unlocks the mutex and waits on the condvar which is a blocking wait (spurious wakeups can happen but that would accompany locking of the mutex). At that point 3 should succeed in acquiring the mutex.

What if two threads are calling pthreadpool_parallelize at once? That's two threads trying to lock n mutexes (without unlocking) in a loop, isn't it?

I think this basically needs the deadlock avoidance algorithm found in std::lock, and I think that algorithm could require O(2^n) pthread_mutex_trylock calls to get n mutexes in the case of contention.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesnt

  /* Protect the global threadpool structures */
  pthread_mutex_lock(&threadpool->execution_mutex);

prevent what you are describing? That access to pthreadpool_parallelize itself is serialized

I think this basically needs the deadlock avoidance algorithm found in std::lock, and I think that algorithm could require O(2^n) pthread_mutex_trylock calls to get n mutexes in the case of contention.

Ok this I dont know a whole lot about so likely need to understand it better

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe if you always attempt to lock mutex i only after locking mutex i - 1, maybe this is safe from that problem, but I'm not sure.

I thought having a total order on locks was a general method to guarantee freedom from deadlock. I'm not super familiar with pthreadpool though, so of course I can't guarantee that there is no other place where some thread holds its own lock and then tries to acquire some other thread's lock, not necessarily in this order.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so of course I can't guarantee that there is no other place where some thread holds its own lock

This is my concern as well. pthreadpool is an unfortunately large codebase, and there definitely are other cases of calling pthread_mutex_lock in a way that might be safe, but also might not be, and it's difficult for me to tell. And even if it's safe now, it's a land mine for someone else to step on in the future...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a land mine for someone else to step on in the future

To be honest the current implementation seems safe to me. But I understand your concerns around this being a land-mine and testing something like this is much more challenging. But then it also feels like you will only ever make incrementally safe change to something that is widely deployed, which definitely makes sense.

One thing I probably will want to follow up is to see if creating separate threadpools and switching between those will offer better alternative (measure perf cost)

Comment on lines +339 to +343
void pthreadpool_set_num_threads_to_use(size_t num_threads) {
max_num_threads = num_threads;
}

size_t pthreadpool_get_num_threads_to_use() { return max_num_threads; }
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's confusing that we now have a thing on each threadpool called num_threads_to_use, but this is a separate unrelated global thing also called num_threads_to_use

Comment on lines +1088 to +1089
* As per this change, this feature is not available in GCD based
* pthreadpool
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per what change?

* events and switch event in use after every submitted command according to
* the high bit of the command word.
*/
HANDLE command_event[2];
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

for (size_t tid = 1; tid < threads_count; tid++) {
/* Lock the command variable to ensure that threads don't shutdown until
* both command and active_threads are updated */
pthread_mutex_lock(&threadpool->threads[tid].command_mutex);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe if you always attempt to lock mutex i only after locking mutex i - 1, maybe this is safe from that problem, but I'm not sure.

I thought having a total order on locks was a general method to guarantee freedom from deadlock. I'm not super familiar with pthreadpool though, so of course I can't guarantee that there is no other place where some thread holds its own lock and then tries to acquire some other thread's lock, not necessarily in this order.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants