Lock continuation before calling event handler. by shinrich · Pull Request #4142 · apache/trafficserver

shinrich · 2018-08-21T21:34:33Z

After further deploying builds with the asserts in handleEvent we have seen several variants of the following stack. In the cases I've dug into the continuation called from reply_to_cont is a HttpSM. And indeed it is unlocked.

Grabbing the lock at the top of probeEvent and rescheduling if either lock cannot be obtained should solve the problem. To simply the lock scope problem, we get the action.mutex lock again if the continuation lock is not present. Should only be a compare and increment overhead.

 0  ] libc-2.17.so       __GI_raise                                                   ( :undefined         ) 
[ 1  ] libc-2.17.so       __GI_abort                                                   ( :undefined         ) 
[ 2  ] libtsutil.so.7     ink_abort                                                    ( ink_error.cc:99    ) 
[ 3  ] libtsutil.so.7     _ink_assert                                                  ( ink_assert.cc:37   ) 
[ 4  ] traffic_server     Continuation::handleEvent(int, void*)                        ( Continuation.cc:32 ) 
[ 5  ] traffic_server     reply_to_cont                                                ( HostDB.cc:491      ) 
[ 6  ] traffic_server     HostDBContinuation::probeEvent(int, Event*)                  ( HostDB.cc:1736     ) 
[ 7  ] traffic_server     HostDBContinuation::dnsPendingEvent(int, Event*)             ( HostDB.cc:1193     ) 
[ 8  ] traffic_server     Continuation::handleEvent(int, void*)                        ( Continuation.cc:33 ) 
[ 9  ] traffic_server     HostDBContinuation::remove_trigger_pending_dns()             ( HostDB.cc:1793     ) 
[ 10 ] traffic_server     HostDBContinuation::dnsEvent(int, HostEnt*)                  ( HostDB.cc:1487     ) 
[ 11 ] traffic_server     Continuation::handleEvent(int, void*)                        ( Continuation.cc:33 ) 
[ 12 ] traffic_server     DNSEntry::postEvent(int, Event*)                             ( DNS.cc:1282        ) 
[ 13 ] traffic_server     Continuation::handleEvent(int, void*)                        ( Continuation.cc:33 ) 
[ 14 ] traffic_server     EThread::process_event(Event*, int)                          ( UnixEThread.cc:132 ) 
[ 15 ] traffic_server     EThread::process_queue(Queue<Event, Event::Link_link>*, i... ( UnixEThread.cc:171 ) 
[ 16 ] traffic_server     EThread::execute_regular()                                   ( UnixEThread.cc:231 ) 
[ 17 ] traffic_server     EThread::execute()                                           ( UnixEThread.cc:326 ) 
[ 18 ] traffic_server     spawn_thread_internal                                        ( Thread.cc:85       ) 
[ 19 ] libpthread-2.17.so start_thread                                                 ( :undefined         )

vmamidi · 2018-08-22T03:58:57Z

when EThread::process_event(Event*, int) is called, it is not acquiring the HttpSM mutex? DNS/HostDB mutex should be same as that of HttpSM, right ?

SolidWallOfCode · 2018-08-22T13:17:58Z

No. The lock in HostDBContinuation should be the lock on the bucket in which the HostDB record resides, so that the continuation can updated it safely.

shinrich · 2018-08-22T14:18:47Z

In any case, this stack has the HttpSM mutex not held because the assert went off.

vmamidi · 2018-08-22T18:06:13Z

it makes sense to hold the bucket mutex. If the theory is right, we should be able to reproduce the scenario as it should crash if DNS takes longer than transaction time.

bryancall · 2018-08-28T16:42:59Z

  MUTEX_TRY_LOCK(lock, action.mutex, t);
-  if (!lock.is_locked()) {
+  // Go ahead and grab the continuation mutex or just grab the action mutex again of there is no continuation mutex
+  MUTEX_TRY_LOCK(lock2, (action.continuation && action.continuation->mutex) ? action.continuation->mutex : action.mutex, t);


Does this only happen with the global session pool? I was talking to @zwoop about this earlier today.

I have only tested with the global pool environment.

bryancall · 2019-03-27T22:33:11Z

Cherry picked to 8.1.0

Lock continuation before calling event handler.

e4b4c89

shinrich added the Core label Aug 21, 2018

shinrich added this to the 9.0.0 milestone Aug 21, 2018

shinrich self-assigned this Aug 21, 2018

SolidWallOfCode approved these changes Aug 27, 2018

View reviewed changes

shinrich mentioned this pull request Aug 27, 2018

Fix lost pending_actions causing actions on stale objects. #4166

Merged

bryancall reviewed Aug 28, 2018

View reviewed changes

shinrich merged commit 4aca609 into apache:master Sep 6, 2018

bryancall modified the milestones: 9.0.0, 8.1.0 Mar 27, 2019

zwoop modified the milestones: 8.1.0, 8.1.0-nogo Mar 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lock continuation before calling event handler.#4142

Lock continuation before calling event handler.#4142
shinrich merged 1 commit intoapache:masterfrom
shinrich:fix-another-missing-hostdb-lock

shinrich commented Aug 21, 2018

Uh oh!

vmamidi commented Aug 22, 2018

Uh oh!

SolidWallOfCode commented Aug 22, 2018

Uh oh!

shinrich commented Aug 22, 2018

Uh oh!

vmamidi commented Aug 22, 2018

Uh oh!

bryancall Aug 28, 2018

Uh oh!

shinrich Sep 6, 2018

Uh oh!

bryancall commented Mar 27, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

shinrich commented Aug 21, 2018

Uh oh!

vmamidi commented Aug 22, 2018

Uh oh!

SolidWallOfCode commented Aug 22, 2018

Uh oh!

shinrich commented Aug 22, 2018

Uh oh!

vmamidi commented Aug 22, 2018

Uh oh!

bryancall Aug 28, 2018

Choose a reason for hiding this comment

Uh oh!

shinrich Sep 6, 2018

Choose a reason for hiding this comment

Uh oh!

bryancall commented Mar 27, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants