Threading fixes #2

mattaezell · 2012-12-09T23:54:22Z

Valgrind has a thread error detection tool called Helgrind. This series adds special support to ignore known-safe threading issues and also fixes several non-safe threading problems.

Helgrind is usable without any special support, but adding explicit support allows known-safe data races or lock order violations to be ignored. Make sure you have the Valgrind header file 'helgrind.h' and configure using '--with-helgrind'. Set the environment variable PBSDEBUG=1 and then run: valgrind --tool=helgrind pbs_server This commit also tells Helgrind to ignore three known data races on global variables: - LOGLEVEL - pbs_tcp_timeout - last_task_check_time

req_selectjobs() calls build_selist(), which returns an unlocked queue. req_selectjobs() proceeds as if the queue is locked, and then tries to unlock the already-unlocked queue. Since req_selectjobs() is the only caller to build_selist(), have it return the queue locked.

Previously, this lock was protecting the line: sock = connection[con].ch_socket; but sock was unused and the line was removed. No need to keep the mutex lock/unlock in place.

The proper lock order for tasks is to acquire the alltasks mutex and then the mutex for the specific task. When a task is created, its mutex is locked. Then, trying to add it to the proper task list acquires the alltasks mutex, which is an ordering violation. This can't cause a deadlock, because nobody else would be waiting on that task's mutex. Move the acquisition of the task mutex to the functions that actually do the insertion. Now, these functions expect the task unlocked and return it locked. Also, there are a couple "trylock" situations that violate lock order, but fall back to correct behavior if they cannot acquire the lock. Ignore the trylocks when using Helgrind.

The functions lock_startup(), lock_conn_table(), and lock_ss() effectively did nothing. This led to potential data races and unlocking non-locked mutexes.

Newly created queues are locked prior to being inserted into a global queue hash. This violates the lock order of allqueues => queue. Change so the proper order is observed.

There are several places where lock order is violated by calling pthread_mutex_trylock(). Deadlock cannot occur, so have Helgrind ignore these situations.

dhh1128 · 2012-12-14T20:32:15Z

Matt: Thanks for submitting this. I'll spend some time seeing if I can get it merged in the next couple work days.

dhh1128 · 2012-12-17T22:22:01Z

src/server/array_func.c

Matt: I am concerned that when HELGRIND is defined, we don't lock at all. It seems like instead of turning off the line that does locks, we'd want to get the order right. Do you agree?

pthread_mutex_trylock() will return non-zero immediately if it cannot acquire the lock (ie, another thread is already holding it). This could occur if the thread was deadlocked due to lock order violations, or just because another thread was busy using it. Either way, if it can't acquire the lock immediately, it drops the lock it already has. It then acquires them in "correct" order.

The IFNDEF simulates pthread_mutex_trylock() returning non-zero (ie, pretend it couldn't acquire the lock and always enter the conditional) so it forces it to use correct order.

Using the pthread_mutex_trylock() is an "optimization" for the usual case. There's no need to drop a lock and reacquire it unless it would otherwise deadlock. I'm fine if we want the policy to be you ALWAYS have to lock in the correct order, but I think it's fine as-is.

4.1 dev

This fixes issue #2. add_to_completed_jobs was not calling free for the task structure.

This fixes issue #2. add_to_completed_jobs was not calling free for the task structure. Conflicts: src/server/completed_jobs_map.cpp

This fixes issue #2. add_to_completed_jobs was not calling free for the task structure.

mattaezell added 7 commits December 8, 2012 16:05

Remove unnecessary locking/unlocking in send_job_work()

c8ebe78

Previously, this lock was protecting the line: sock = connection[con].ch_socket; but sock was unused and the line was removed. No need to keep the mutex lock/unlock in place.

Fix lock_startup(), lock_conn_table(), and lock_ss() to actually lock

4c72503

The functions lock_startup(), lock_conn_table(), and lock_ss() effectively did nothing. This led to potential data races and unlocking non-locked mutexes.

Observe correct lock order when creating queues

38f5d54

Newly created queues are locked prior to being inserted into a global queue hash. This violates the lock order of allqueues => queue. Change so the proper order is observed.

Have Helgrind ignore pthread_mutex_trylock() lock order violations

2af2297

There are several places where lock order is violated by calling pthread_mutex_trylock(). Deadlock cannot occur, so have Helgrind ignore these situations.

ghost assigned knielson Dec 17, 2012

dhh1128 reviewed Dec 17, 2012
View reviewed changes

dbeer referenced this pull request in dbeer/torque Jan 22, 2013

Merge pull request actorquedeveloper#2 from adaptivecomputing/4.1-dev

89f4950

4.1 dev

knielson closed this Feb 27, 2013

knielson added a commit that referenced this pull request Sep 9, 2015

TRQ-3236 Valgrind memory leaks

61491b3

This fixes issue #2. add_to_completed_jobs was not calling free for the task structure.

knielson added a commit that referenced this pull request Sep 9, 2015

TRQ-3236 Valgrind memory leaks

0411d6f

This fixes issue #2. add_to_completed_jobs was not calling free for the task structure. Conflicts: src/server/completed_jobs_map.cpp

knielson added a commit that referenced this pull request Sep 9, 2015

TRQ-3236 Valgrind memory leaks

73de711

This fixes issue #2. add_to_completed_jobs was not calling free for the task structure.

knielson added a commit that referenced this pull request Sep 9, 2015

TRQ-3236 Valgrind memory leaks

cbc3eea

This fixes issue #2. add_to_completed_jobs was not calling free for the task structure.

widyono-cets mentioned this pull request Mar 28, 2017

pbs_server segmentation fault triggered by qdel #421

Closed

mattmix mentioned this pull request May 19, 2017

Incorrectly formated resource request crashes pbs_server #425

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Threading fixes #2

Threading fixes #2

Uh oh!

mattaezell commented Dec 9, 2012

Uh oh!

dhh1128 commented Dec 14, 2012

Uh oh!

dhh1128 Dec 17, 2012

Uh oh!

mattaezell Dec 17, 2012

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Threading fixes #2

Threading fixes #2

Uh oh!

Conversation

mattaezell commented Dec 9, 2012

Uh oh!

dhh1128 commented Dec 14, 2012

Uh oh!

dhh1128 Dec 17, 2012

Choose a reason for hiding this comment

Uh oh!

mattaezell Dec 17, 2012

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants