-
Notifications
You must be signed in to change notification settings - Fork 142
Threading fixes #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Helgrind is usable without any special support, but adding explicit support allows known-safe data races or lock order violations to be ignored. Make sure you have the Valgrind header file 'helgrind.h' and configure using '--with-helgrind'. Set the environment variable PBSDEBUG=1 and then run: valgrind --tool=helgrind pbs_server This commit also tells Helgrind to ignore three known data races on global variables: - LOGLEVEL - pbs_tcp_timeout - last_task_check_time
req_selectjobs() calls build_selist(), which returns an unlocked queue. req_selectjobs() proceeds as if the queue is locked, and then tries to unlock the already-unlocked queue. Since req_selectjobs() is the only caller to build_selist(), have it return the queue locked.
Previously, this lock was protecting the line: sock = connection[con].ch_socket; but sock was unused and the line was removed. No need to keep the mutex lock/unlock in place.
The proper lock order for tasks is to acquire the alltasks mutex and then the mutex for the specific task. When a task is created, its mutex is locked. Then, trying to add it to the proper task list acquires the alltasks mutex, which is an ordering violation. This can't cause a deadlock, because nobody else would be waiting on that task's mutex. Move the acquisition of the task mutex to the functions that actually do the insertion. Now, these functions expect the task unlocked and return it locked. Also, there are a couple "trylock" situations that violate lock order, but fall back to correct behavior if they cannot acquire the lock. Ignore the trylocks when using Helgrind.
The functions lock_startup(), lock_conn_table(), and lock_ss() effectively did nothing. This led to potential data races and unlocking non-locked mutexes.
Newly created queues are locked prior to being inserted into a global queue hash. This violates the lock order of allqueues => queue. Change so the proper order is observed.
There are several places where lock order is violated by calling pthread_mutex_trylock(). Deadlock cannot occur, so have Helgrind ignore these situations.
|
Matt: Thanks for submitting this. I'll spend some time seeing if I can get it merged in the next couple work days. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Matt: I am concerned that when HELGRIND is defined, we don't lock at all. It seems like instead of turning off the line that does locks, we'd want to get the order right. Do you agree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pthread_mutex_trylock() will return non-zero immediately if it cannot acquire the lock (ie, another thread is already holding it). This could occur if the thread was deadlocked due to lock order violations, or just because another thread was busy using it. Either way, if it can't acquire the lock immediately, it drops the lock it already has. It then acquires them in "correct" order.
The IFNDEF simulates pthread_mutex_trylock() returning non-zero (ie, pretend it couldn't acquire the lock and always enter the conditional) so it forces it to use correct order.
Using the pthread_mutex_trylock() is an "optimization" for the usual case. There's no need to drop a lock and reacquire it unless it would otherwise deadlock. I'm fine if we want the policy to be you ALWAYS have to lock in the correct order, but I think it's fine as-is.
This fixes issue #2. add_to_completed_jobs was not calling free for the task structure.
This fixes issue #2. add_to_completed_jobs was not calling free for the task structure. Conflicts: src/server/completed_jobs_map.cpp
This fixes issue #2. add_to_completed_jobs was not calling free for the task structure.
This fixes issue #2. add_to_completed_jobs was not calling free for the task structure.
Valgrind has a thread error detection tool called Helgrind. This series adds special support to ignore known-safe threading issues and also fixes several non-safe threading problems.