Skip to content

Lock continuation before calling event handler.#4142

Merged
shinrich merged 1 commit intoapache:masterfrom
shinrich:fix-another-missing-hostdb-lock
Sep 6, 2018
Merged

Lock continuation before calling event handler.#4142
shinrich merged 1 commit intoapache:masterfrom
shinrich:fix-another-missing-hostdb-lock

Conversation

@shinrich
Copy link
Copy Markdown
Member

After further deploying builds with the asserts in handleEvent we have seen several variants of the following stack. In the cases I've dug into the continuation called from reply_to_cont is a HttpSM. And indeed it is unlocked.

Grabbing the lock at the top of probeEvent and rescheduling if either lock cannot be obtained should solve the problem. To simply the lock scope problem, we get the action.mutex lock again if the continuation lock is not present. Should only be a compare and increment overhead.

 0  ] libc-2.17.so       __GI_raise                                                   ( :undefined         ) 
[ 1  ] libc-2.17.so       __GI_abort                                                   ( :undefined         ) 
[ 2  ] libtsutil.so.7     ink_abort                                                    ( ink_error.cc:99    ) 
[ 3  ] libtsutil.so.7     _ink_assert                                                  ( ink_assert.cc:37   ) 
[ 4  ] traffic_server     Continuation::handleEvent(int, void*)                        ( Continuation.cc:32 ) 
[ 5  ] traffic_server     reply_to_cont                                                ( HostDB.cc:491      ) 
[ 6  ] traffic_server     HostDBContinuation::probeEvent(int, Event*)                  ( HostDB.cc:1736     ) 
[ 7  ] traffic_server     HostDBContinuation::dnsPendingEvent(int, Event*)             ( HostDB.cc:1193     ) 
[ 8  ] traffic_server     Continuation::handleEvent(int, void*)                        ( Continuation.cc:33 ) 
[ 9  ] traffic_server     HostDBContinuation::remove_trigger_pending_dns()             ( HostDB.cc:1793     ) 
[ 10 ] traffic_server     HostDBContinuation::dnsEvent(int, HostEnt*)                  ( HostDB.cc:1487     ) 
[ 11 ] traffic_server     Continuation::handleEvent(int, void*)                        ( Continuation.cc:33 ) 
[ 12 ] traffic_server     DNSEntry::postEvent(int, Event*)                             ( DNS.cc:1282        ) 
[ 13 ] traffic_server     Continuation::handleEvent(int, void*)                        ( Continuation.cc:33 ) 
[ 14 ] traffic_server     EThread::process_event(Event*, int)                          ( UnixEThread.cc:132 ) 
[ 15 ] traffic_server     EThread::process_queue(Queue<Event, Event::Link_link>*, i... ( UnixEThread.cc:171 ) 
[ 16 ] traffic_server     EThread::execute_regular()                                   ( UnixEThread.cc:231 ) 
[ 17 ] traffic_server     EThread::execute()                                           ( UnixEThread.cc:326 ) 
[ 18 ] traffic_server     spawn_thread_internal                                        ( Thread.cc:85       ) 
[ 19 ] libpthread-2.17.so start_thread                                                 ( :undefined         )

@shinrich shinrich added the Core label Aug 21, 2018
@shinrich shinrich added this to the 9.0.0 milestone Aug 21, 2018
@shinrich shinrich self-assigned this Aug 21, 2018
@vmamidi
Copy link
Copy Markdown
Contributor

vmamidi commented Aug 22, 2018

when EThread::process_event(Event*, int) is called, it is not acquiring the HttpSM mutex? DNS/HostDB mutex should be same as that of HttpSM, right ?

@SolidWallOfCode
Copy link
Copy Markdown
Member

No. The lock in HostDBContinuation should be the lock on the bucket in which the HostDB record resides, so that the continuation can updated it safely.

@shinrich
Copy link
Copy Markdown
Member Author

In any case, this stack has the HttpSM mutex not held because the assert went off.

@vmamidi
Copy link
Copy Markdown
Contributor

vmamidi commented Aug 22, 2018

it makes sense to hold the bucket mutex. If the theory is right, we should be able to reproduce the scenario as it should crash if DNS takes longer than transaction time.

Comment thread iocore/hostdb/HostDB.cc
MUTEX_TRY_LOCK(lock, action.mutex, t);
if (!lock.is_locked()) {
// Go ahead and grab the continuation mutex or just grab the action mutex again of there is no continuation mutex
MUTEX_TRY_LOCK(lock2, (action.continuation && action.continuation->mutex) ? action.continuation->mutex : action.mutex, t);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this only happen with the global session pool? I was talking to @zwoop about this earlier today.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have only tested with the global pool environment.

@shinrich shinrich merged commit 4aca609 into apache:master Sep 6, 2018
@bryancall bryancall modified the milestones: 9.0.0, 8.1.0 Mar 27, 2019
@bryancall
Copy link
Copy Markdown
Contributor

Cherry picked to 8.1.0

@zwoop zwoop modified the milestones: 8.1.0, 8.1.0-nogo Mar 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants