Skip to content

Conversation

@mattaezell
Copy link
Contributor

Valgrind has a thread error detection tool called Helgrind. This series adds special support to ignore known-safe threading issues and also fixes several non-safe threading problems.

Helgrind is usable without any special support, but adding explicit
support allows known-safe data races or lock order violations to be
ignored.

Make sure you have the Valgrind header file 'helgrind.h' and configure
using '--with-helgrind'.  Set the environment variable PBSDEBUG=1 and
then run:
valgrind --tool=helgrind pbs_server

This commit also tells Helgrind to ignore three known data races on
global variables:
- LOGLEVEL
- pbs_tcp_timeout
- last_task_check_time
req_selectjobs() calls build_selist(), which returns an unlocked queue.
req_selectjobs() proceeds as if the queue is locked, and then tries to
unlock the already-unlocked queue.  Since req_selectjobs() is the only
caller to build_selist(), have it return the queue locked.
Previously, this lock was protecting the line:
  sock = connection[con].ch_socket;
but sock was unused and the line was removed. No need to keep the mutex
lock/unlock in place.
The proper lock order for tasks is to acquire the alltasks mutex and
then the mutex for the specific task.
When a task is created, its mutex is locked. Then, trying to add
it to the proper task list acquires the alltasks mutex, which is an
ordering violation. This can't cause a deadlock, because nobody else
would be waiting on that task's mutex. Move the acquisition of the task
mutex to the functions that actually do the insertion. Now, these
functions expect the task unlocked and return it locked.
Also, there are a couple "trylock" situations that violate lock order,
but fall back to correct behavior if they cannot acquire the lock.
Ignore the trylocks when using Helgrind.
The functions lock_startup(), lock_conn_table(), and lock_ss()
effectively did nothing.  This led to potential data races and unlocking
non-locked mutexes.
Newly created queues are locked prior to being inserted into a global
queue hash. This violates the lock order of allqueues => queue. Change so
the proper order is observed.
There are several places where lock order is violated by calling
pthread_mutex_trylock(). Deadlock cannot occur, so have Helgrind ignore
these situations.
@dhh1128
Copy link
Contributor

dhh1128 commented Dec 14, 2012

Matt: Thanks for submitting this. I'll spend some time seeing if I can get it merged in the next couple work days.

@ghost ghost assigned knielson Dec 17, 2012
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matt: I am concerned that when HELGRIND is defined, we don't lock at all. It seems like instead of turning off the line that does locks, we'd want to get the order right. Do you agree?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pthread_mutex_trylock() will return non-zero immediately if it cannot acquire the lock (ie, another thread is already holding it). This could occur if the thread was deadlocked due to lock order violations, or just because another thread was busy using it. Either way, if it can't acquire the lock immediately, it drops the lock it already has. It then acquires them in "correct" order.

The IFNDEF simulates pthread_mutex_trylock() returning non-zero (ie, pretend it couldn't acquire the lock and always enter the conditional) so it forces it to use correct order.

Using the pthread_mutex_trylock() is an "optimization" for the usual case. There's no need to drop a lock and reacquire it unless it would otherwise deadlock. I'm fine if we want the policy to be you ALWAYS have to lock in the correct order, but I think it's fine as-is.

@knielson knielson closed this Feb 27, 2013
knielson added a commit that referenced this pull request Sep 9, 2015
This fixes issue #2. add_to_completed_jobs was not
calling free for the task structure.
knielson added a commit that referenced this pull request Sep 9, 2015
This fixes issue #2. add_to_completed_jobs was not
calling free for the task structure.

Conflicts:
	src/server/completed_jobs_map.cpp
knielson added a commit that referenced this pull request Sep 9, 2015
This fixes issue #2. add_to_completed_jobs was not
calling free for the task structure.
knielson added a commit that referenced this pull request Sep 9, 2015
This fixes issue #2. add_to_completed_jobs was not
calling free for the task structure.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants