Skip to content

Fix lost pending_actions causing actions on stale objects.#4166

Merged
shinrich merged 1 commit intoapache:masterfrom
shinrich:fix-stale-hostdb-actions
Sep 20, 2018
Merged

Fix lost pending_actions causing actions on stale objects.#4166
shinrich merged 1 commit intoapache:masterfrom
shinrich:fix-stale-hostdb-actions

Conversation

@shinrich
Copy link
Copy Markdown
Member

@shinrich shinrich commented Aug 27, 2018

We have been discovering and fixing places where hostdb was invoking handlers against HttpSM without a lock.

As we have been running this fix in production, our set of crashes has changed. Specifically we were seeing HOSTDB_LOOKUP events showing up in HttpSM::state_http_server_open and causing an unexpected event assert. I think these changes were pushing crashes down the pipe. Looking more closely at the cores, the HOSTDB_LOOKUP event was being invoked against a stale HttpSM. Looking further down the stack, the hostname that DNS was using was different that the server hostname referenced in HttpSM. So it seems that there was a pending hostdb action that wasn't canceled before the origin HttpSM object was destroyed. The other exciting observation was that the cores seemed to happen in groups across machines implying there was some odd off box actions affecting the ATS flow.

Here is an example of that stack.

#0  0x00002b42cfc661d7 in raise () from /lib64/libc.so.6
#1  0x00002b42cfc678c8 in abort () from /lib64/libc.so.6
#2  0x00002b42cd183277 in ink_abort (message_format=0x2b42cd1ac080 "%s:%d: failed assertion `%s`") at ../../../../trafficserver/lib/ts/ink_error.cc:99
#3  0x00002b42cd1803c7 in _ink_assert (expression=0x864dc7 "0", file=0x864d98 "../../../../trafficserver/proxy/http/HttpSM.cc", line=1941) at ../../../../trafficserver/lib/ts/ink_assert.cc:37
#4  0x0000000000657bf7 in HttpSM::state_http_server_open (this=0x2b4331c64400, event=500, data=0x13ccc00) at ../../../../trafficserver/proxy/http/HttpSM.cc:1941
#5  0x000000000065aec1 in HttpSM::main_handler (this=0x2b4331c64400, event=500, data=0x13ccc00) at ../../../../trafficserver/proxy/http/HttpSM.cc:2802
#6  0x0000000000830ff5 in Continuation::handleEvent (this=0x2b4331c64400, event=500, data=0x13ccc00) at ../../../../trafficserver/iocore/eventsystem/Continuation.cc:33
#7  0x000000000075101f in reply_to_cont (cont=0x2b4331c64400, r=0x13ccc00, is_srv=false) at ../../../../trafficserver/iocore/hostdb/HostDB.cc:491
#8  0x0000000000755886 in HostDBContinuation::dnsEvent (this=0x2b42cd0fd9d0, event=600, e=0x2b435c127440) at ../../../../trafficserver/iocore/hostdb/HostDB.cc:1470
#9  0x0000000000830ff5 in Continuation::handleEvent (this=0x2b42cd0fd9d0, event=600, data=0x2b435c127440) at ../../../../trafficserver/iocore/eventsystem/Continuation.cc:33
#10 0x00000000007687f6 in DNSEntry::postEvent (this=0x2b42f01bce00) at ../../../../trafficserver/iocore/dns/DNS.cc:1282
#11 0x0000000000830ff5 in Continuation::handleEvent (this=0x2b42f01bce00, event=1, data=0x2b4333faa6e0) at ../../../../trafficserver/iocore/eventsystem/Continuation.cc:33
#12 0x000000000083312b in EThread::process_event (this=0x2b42d26bb010, e=0x2b4333faa6e0, calling_code=1) at ../../../../trafficserver/iocore/eventsystem/UnixEThread.cc:132
#13 0x0000000000833302 in EThread::process_queue (this=0x2b42d26bb010, NegativeQueue=0x2b42f54a3e10, ev_count=0x2b42f54a3e08, nq_count=0x2b42f54a3e0c) at ../../../../trafficserver/iocore/eventsystem/UnixEThread.cc:171
#14 0x00000000008335ca in EThread::execute_regular (this=0x2b42d26bb010) at ../../../../trafficserver/iocore/eventsystem/UnixEThread.cc:231
#15 0x0000000000833949 in EThread::execute (this=0x2b42d26bb010) at ../../../../trafficserver/iocore/eventsystem/UnixEThread.cc:326
#16 0x000000000083257e in spawn_thread_internal (a=0x1247070) at ../../../../trafficserver/iocore/eventsystem/Thread.cc:85
#17 0x00002b42ceff9dc5 in start_thread () from /lib64/libpthread.so.0
#18 0x00002b42cfd2876d in clone () from /lib64/libc.so.6

I created another build that added ink_release_asserts on pending_action in HttpSM::state_http_server_open. The assertion is that pending_action should be nullptr before reassigning it. If you reassign a pending_action that is not null, you lose the reference to that action and you won't be able to cancel that action when HttpSM shuts down. From that build, we were able to get core with the original HttpSM. Here is an example core.

#0  0x00002b4c445cb1d7 in raise () from /lib64/libc.so.6
#1  0x00002b4c445cc8c8 in abort () from /lib64/libc.so.6
#2  0x00002b4c41ae8277 in ink_abort (message_format=0x2b4c41b11080 "%s:%d: failed assertion `%s`") at ../../../../trafficserver/lib/ts/ink_error.cc:99
#3  0x00002b4c41ae53c7 in _ink_assert (expression=0x865780 "event == EVENT_INTERVAL || pending_action == nullptr", file=0x864c38 "../../../../trafficserver/proxy/http/HttpSM.cc", line=1833)
    at ../../../../trafficserver/lib/ts/ink_assert.cc:37
#4  0x00000000006570c5 in HttpSM::state_http_server_open (this=0x2b4c84697c00, event=3, data=0x2b4e64f22f90) at ../../../../trafficserver/proxy/http/HttpSM.cc:1833
#5  0x000000000065ad79 in HttpSM::main_handler (this=0x2b4c84697c00, event=3, data=0x2b4e64f22f90) at ../../../../trafficserver/proxy/http/HttpSM.cc:2799
#6  0x0000000000830eb9 in Continuation::handleEvent (this=0x2b4c84697c00, event=3, data=0x2b4e64f22f90) at ../../../../trafficserver/iocore/eventsystem/Continuation.cc:33
#7  0x000000000080de66 in read_signal_and_update (event=3, vc=0x2b4e64f22e40) at ../../../../trafficserver/iocore/net/UnixNetVConnection.cc:144
#8  0x000000000080e1e1 in read_signal_done (event=3, nh=0x2b4c47998cf0, vc=0x2b4e64f22e40) at ../../../../trafficserver/iocore/net/UnixNetVConnection.cc:205
#9  0x000000000080e299 in read_signal_error (nh=0x2b4c47998cf0, vc=0x2b4e64f22e40, lerrno=0) at ../../../../trafficserver/iocore/net/UnixNetVConnection.cc:229
#10 0x0000000000810eb3 in UnixNetVConnection::readSignalError (this=0x2b4e64f22e40, nh=0x2b4c47998cf0, err=0) at ../../../../trafficserver/iocore/net/UnixNetVConnection.cc:1090
#11 0x00000000007e8617 in SSLNetVConnection::net_read_io (this=0x2b4e64f22e40, nh=0x2b4c47998cf0, lthread=0x2b4c47995010) at ../../../../trafficserver/iocore/net/SSLNetVConnection.cc:541
#12 0x0000000000805f4f in NetHandler::waitForActivity (this=0x2b4c47998cf0, timeout=60000000) at ../../../../trafficserver/iocore/net/UnixNet.cc:497
#13 0x0000000000833646 in EThread::execute_regular (this=0x2b4c47995010) at ../../../../trafficserver/iocore/eventsystem/UnixEThread.cc:273
#14 0x000000000083380d in EThread::execute (this=0x2b4c47995010) at ../../../../trafficserver/iocore/eventsystem/UnixEThread.cc:326
#15 0x0000000000832442 in spawn_thread_internal (a=0x120bee0) at ../../../../trafficserver/iocore/eventsystem/Thread.cc:85
#16 0x00002b4c4395edc5 in start_thread () from /lib64/libpthread.so.0
#17 0x00002b4c4468d76d in clone () from /lib64/libc.so.6

In this case, the last bit of history on the HttpSM is interesting.

{fileline = 0x865728 "../../../../trafficserver/proxy/http/HttpSM.cc:1832", event = 200, reentrancy = 2}, 
{fileline = 0x865728 "../../../../trafficserver/proxy/http/HttpSM.cc:1832", event = 105, reentrancy = 1}, 
{fileline = 0x868a28 "../../../../trafficserver/proxy/http/HttpSM.cc:7745", event = 65535, reentrancy = 1},
{fileline = 0x865728 "../../../../trafficserver/proxy/http/HttpSM.cc:1832", event = 3, reentrancy = 1},

The first line says that we called state_http_server_open with a NET_OPEN event. Basically ATS has sent the SYN to the origin.

The second line says that state_http_server_open was called with an inactivity timeout error.

The third line is setting the default handler to HttpSM::state_mark_os_down.

But the fourth line is back at state_http_server_open with a EVENT_ERROR. Which is odd because we have already done the error stuff for that socket.

Looking into the code, the error case will try additional addresses for the origin. This may just pick up the next address in the cached record, but if we are unlucky the cached DNS info may have expired in the 30-40 seconds since we started. If we go through the error case again, we may set up two dns requests for the same state machine. It is quite possible that the state machine will be long gone by the time the second DNS request goes off. If we have overwritten the pending_action at the start of HttpSM::state_http_server_open, there is no way to cancel the event before the state machine goes away.

To address this issue and avoid the double error event, the right thing to do is call release_server_session() in the error case of HttpSM::state_http_server_open. If it session has read no response, that function will call do_io_close on the server vc (freeing the backing memory) and remove the related server_entry structure and epoll entries.

We have been running this code in one colo since Thursday night and have had no crashes. Where before we were getting crashes very regularly during high traffic time (presumably because of overloaded origins).

@shinrich shinrich added this to the 9.0.0 milestone Aug 27, 2018
@shinrich shinrich self-assigned this Aug 27, 2018
zizhong
zizhong previously approved these changes Aug 27, 2018
Comment thread proxy/http/HttpSM.cc
// then the value of kill_this_async_done has changed so
// we must check it again
if (kill_this_async_done == true) {
if (pending_action) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense according to the backtraces.
might remove line 6854 as well since it's useless now.

Copy link
Copy Markdown
Contributor

@zwoop zwoop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think zizhong's suggestion should be done, leaving that assert would be confusing.

@shinrich
Copy link
Copy Markdown
Member Author

shinrich commented Sep 6, 2018

Good point @zizhong. Updated to remove the useless assert.

@shinrich
Copy link
Copy Markdown
Member Author

@zizhong @zwoop look ok now?

zwoop
zwoop previously approved these changes Sep 10, 2018
zizhong
zizhong previously approved these changes Sep 10, 2018
@shinrich shinrich dismissed stale reviews from zizhong and zwoop via a1ba929 September 10, 2018 18:16
@shinrich shinrich force-pushed the fix-stale-hostdb-actions branch from a1ba929 to 25ca497 Compare September 10, 2018 20:36
Comment thread iocore/net/P_UnixNetVConnection.h Outdated
void cancel_active_timeout() override;
void cancel_inactivity_timeout() override;
void set_action(Continuation *c) override;
Action *get_action();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const Action *get_action() const; would be better to prevent abuse.

@shinrich shinrich force-pushed the fix-stale-hostdb-actions branch from 25ca497 to a63cecc Compare September 12, 2018 16:05
@shinrich
Copy link
Copy Markdown
Member Author

shinrich commented Sep 12, 2018

Update the commit to make the get_action() method const and return a const pointer as @maskit suggested.

@zwoop had(has?) this commit running on the trafficserver docs machine for a day or two. We've had this running internally for several weeks.

I know that one of @d2r 's commits is running into this issue. So it would be good if we could bring this fix in soon and possibly make the next 8.0 release candidate.

@d2r
Copy link
Copy Markdown
Contributor

d2r commented Sep 12, 2018

I know that one of @d2r 's commits is running into this issue. So it would be good if we could bring this fix in soon and possibly make the next 8.0 release candidate.

I am testing with this patch now to see if a new assert in #4145 is still triggered.

@d2r
Copy link
Copy Markdown
Contributor

d2r commented Sep 12, 2018

I am testing with this patch now to see if a new assert in #4145 is still triggered.

Testing with this patch still produced errors. We found that the errors was caused with a failed hostdb lookup, and so it is not directly related to this PR. So I dropped the dependency on this PR.

@shinrich
Copy link
Copy Markdown
Member Author

While this wasn't an issue for @d2r's issue. This does fix a real issue. Can someone please review? I think it passed muster with @zwoop and @maskit

@shinrich shinrich force-pushed the fix-stale-hostdb-actions branch from a63cecc to dca3226 Compare September 19, 2018 16:21
@shinrich
Copy link
Copy Markdown
Member Author

Actually just pushed up a new version to expand the assert in HttpSM::state_http_server_open to include the NET_EVENT_OPEN_FAILED

@shinrich shinrich merged commit 493803c into apache:master Sep 20, 2018
@bryancall
Copy link
Copy Markdown
Contributor

Another PR was created to pull this in to the 8.1.0 release #5235

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants