Fix lost pending_actions causing actions on stale objects. by shinrich · Pull Request #4166 · apache/trafficserver

shinrich · 2018-08-27T18:31:43Z

We have been discovering and fixing places where hostdb was invoking handlers against HttpSM without a lock.

As we have been running this fix in production, our set of crashes has changed. Specifically we were seeing HOSTDB_LOOKUP events showing up in HttpSM::state_http_server_open and causing an unexpected event assert. I think these changes were pushing crashes down the pipe. Looking more closely at the cores, the HOSTDB_LOOKUP event was being invoked against a stale HttpSM. Looking further down the stack, the hostname that DNS was using was different that the server hostname referenced in HttpSM. So it seems that there was a pending hostdb action that wasn't canceled before the origin HttpSM object was destroyed. The other exciting observation was that the cores seemed to happen in groups across machines implying there was some odd off box actions affecting the ATS flow.

Here is an example of that stack.

#0  0x00002b42cfc661d7 in raise () from /lib64/libc.so.6
#1  0x00002b42cfc678c8 in abort () from /lib64/libc.so.6
#2  0x00002b42cd183277 in ink_abort (message_format=0x2b42cd1ac080 "%s:%d: failed assertion `%s`") at ../../../../trafficserver/lib/ts/ink_error.cc:99
#3  0x00002b42cd1803c7 in _ink_assert (expression=0x864dc7 "0", file=0x864d98 "../../../../trafficserver/proxy/http/HttpSM.cc", line=1941) at ../../../../trafficserver/lib/ts/ink_assert.cc:37
#4  0x0000000000657bf7 in HttpSM::state_http_server_open (this=0x2b4331c64400, event=500, data=0x13ccc00) at ../../../../trafficserver/proxy/http/HttpSM.cc:1941
#5  0x000000000065aec1 in HttpSM::main_handler (this=0x2b4331c64400, event=500, data=0x13ccc00) at ../../../../trafficserver/proxy/http/HttpSM.cc:2802
#6  0x0000000000830ff5 in Continuation::handleEvent (this=0x2b4331c64400, event=500, data=0x13ccc00) at ../../../../trafficserver/iocore/eventsystem/Continuation.cc:33
#7  0x000000000075101f in reply_to_cont (cont=0x2b4331c64400, r=0x13ccc00, is_srv=false) at ../../../../trafficserver/iocore/hostdb/HostDB.cc:491
#8  0x0000000000755886 in HostDBContinuation::dnsEvent (this=0x2b42cd0fd9d0, event=600, e=0x2b435c127440) at ../../../../trafficserver/iocore/hostdb/HostDB.cc:1470
#9  0x0000000000830ff5 in Continuation::handleEvent (this=0x2b42cd0fd9d0, event=600, data=0x2b435c127440) at ../../../../trafficserver/iocore/eventsystem/Continuation.cc:33
#10 0x00000000007687f6 in DNSEntry::postEvent (this=0x2b42f01bce00) at ../../../../trafficserver/iocore/dns/DNS.cc:1282
#11 0x0000000000830ff5 in Continuation::handleEvent (this=0x2b42f01bce00, event=1, data=0x2b4333faa6e0) at ../../../../trafficserver/iocore/eventsystem/Continuation.cc:33
#12 0x000000000083312b in EThread::process_event (this=0x2b42d26bb010, e=0x2b4333faa6e0, calling_code=1) at ../../../../trafficserver/iocore/eventsystem/UnixEThread.cc:132
#13 0x0000000000833302 in EThread::process_queue (this=0x2b42d26bb010, NegativeQueue=0x2b42f54a3e10, ev_count=0x2b42f54a3e08, nq_count=0x2b42f54a3e0c) at ../../../../trafficserver/iocore/eventsystem/UnixEThread.cc:171
#14 0x00000000008335ca in EThread::execute_regular (this=0x2b42d26bb010) at ../../../../trafficserver/iocore/eventsystem/UnixEThread.cc:231
#15 0x0000000000833949 in EThread::execute (this=0x2b42d26bb010) at ../../../../trafficserver/iocore/eventsystem/UnixEThread.cc:326
#16 0x000000000083257e in spawn_thread_internal (a=0x1247070) at ../../../../trafficserver/iocore/eventsystem/Thread.cc:85
#17 0x00002b42ceff9dc5 in start_thread () from /lib64/libpthread.so.0
#18 0x00002b42cfd2876d in clone () from /lib64/libc.so.6

I created another build that added ink_release_asserts on pending_action in HttpSM::state_http_server_open. The assertion is that pending_action should be nullptr before reassigning it. If you reassign a pending_action that is not null, you lose the reference to that action and you won't be able to cancel that action when HttpSM shuts down. From that build, we were able to get core with the original HttpSM. Here is an example core.

#0  0x00002b4c445cb1d7 in raise () from /lib64/libc.so.6
#1  0x00002b4c445cc8c8 in abort () from /lib64/libc.so.6
#2  0x00002b4c41ae8277 in ink_abort (message_format=0x2b4c41b11080 "%s:%d: failed assertion `%s`") at ../../../../trafficserver/lib/ts/ink_error.cc:99
#3  0x00002b4c41ae53c7 in _ink_assert (expression=0x865780 "event == EVENT_INTERVAL || pending_action == nullptr", file=0x864c38 "../../../../trafficserver/proxy/http/HttpSM.cc", line=1833)
    at ../../../../trafficserver/lib/ts/ink_assert.cc:37
#4  0x00000000006570c5 in HttpSM::state_http_server_open (this=0x2b4c84697c00, event=3, data=0x2b4e64f22f90) at ../../../../trafficserver/proxy/http/HttpSM.cc:1833
#5  0x000000000065ad79 in HttpSM::main_handler (this=0x2b4c84697c00, event=3, data=0x2b4e64f22f90) at ../../../../trafficserver/proxy/http/HttpSM.cc:2799
#6  0x0000000000830eb9 in Continuation::handleEvent (this=0x2b4c84697c00, event=3, data=0x2b4e64f22f90) at ../../../../trafficserver/iocore/eventsystem/Continuation.cc:33
#7  0x000000000080de66 in read_signal_and_update (event=3, vc=0x2b4e64f22e40) at ../../../../trafficserver/iocore/net/UnixNetVConnection.cc:144
#8  0x000000000080e1e1 in read_signal_done (event=3, nh=0x2b4c47998cf0, vc=0x2b4e64f22e40) at ../../../../trafficserver/iocore/net/UnixNetVConnection.cc:205
#9  0x000000000080e299 in read_signal_error (nh=0x2b4c47998cf0, vc=0x2b4e64f22e40, lerrno=0) at ../../../../trafficserver/iocore/net/UnixNetVConnection.cc:229
#10 0x0000000000810eb3 in UnixNetVConnection::readSignalError (this=0x2b4e64f22e40, nh=0x2b4c47998cf0, err=0) at ../../../../trafficserver/iocore/net/UnixNetVConnection.cc:1090
#11 0x00000000007e8617 in SSLNetVConnection::net_read_io (this=0x2b4e64f22e40, nh=0x2b4c47998cf0, lthread=0x2b4c47995010) at ../../../../trafficserver/iocore/net/SSLNetVConnection.cc:541
#12 0x0000000000805f4f in NetHandler::waitForActivity (this=0x2b4c47998cf0, timeout=60000000) at ../../../../trafficserver/iocore/net/UnixNet.cc:497
#13 0x0000000000833646 in EThread::execute_regular (this=0x2b4c47995010) at ../../../../trafficserver/iocore/eventsystem/UnixEThread.cc:273
#14 0x000000000083380d in EThread::execute (this=0x2b4c47995010) at ../../../../trafficserver/iocore/eventsystem/UnixEThread.cc:326
#15 0x0000000000832442 in spawn_thread_internal (a=0x120bee0) at ../../../../trafficserver/iocore/eventsystem/Thread.cc:85
#16 0x00002b4c4395edc5 in start_thread () from /lib64/libpthread.so.0
#17 0x00002b4c4468d76d in clone () from /lib64/libc.so.6

In this case, the last bit of history on the HttpSM is interesting.

{fileline = 0x865728 "../../../../trafficserver/proxy/http/HttpSM.cc:1832", event = 200, reentrancy = 2}, 
{fileline = 0x865728 "../../../../trafficserver/proxy/http/HttpSM.cc:1832", event = 105, reentrancy = 1}, 
{fileline = 0x868a28 "../../../../trafficserver/proxy/http/HttpSM.cc:7745", event = 65535, reentrancy = 1},
{fileline = 0x865728 "../../../../trafficserver/proxy/http/HttpSM.cc:1832", event = 3, reentrancy = 1},

The first line says that we called state_http_server_open with a NET_OPEN event. Basically ATS has sent the SYN to the origin.

The second line says that state_http_server_open was called with an inactivity timeout error.

The third line is setting the default handler to HttpSM::state_mark_os_down.

But the fourth line is back at state_http_server_open with a EVENT_ERROR. Which is odd because we have already done the error stuff for that socket.

Looking into the code, the error case will try additional addresses for the origin. This may just pick up the next address in the cached record, but if we are unlucky the cached DNS info may have expired in the 30-40 seconds since we started. If we go through the error case again, we may set up two dns requests for the same state machine. It is quite possible that the state machine will be long gone by the time the second DNS request goes off. If we have overwritten the pending_action at the start of HttpSM::state_http_server_open, there is no way to cancel the event before the state machine goes away.

To address this issue and avoid the double error event, the right thing to do is call release_server_session() in the error case of HttpSM::state_http_server_open. If it session has read no response, that function will call do_io_close on the server vc (freeing the backing memory) and remove the related server_entry structure and epoll entries.

We have been running this code in one colo since Thursday night and have had no crashes. Where before we were getting crashes very regularly during high traffic time (presumably because of overloaded origins).

zizhong · 2018-08-27T22:48:07Z

  //   then the value of kill_this_async_done has changed so
  //   we must check it again
  if (kill_this_async_done == true) {
+    if (pending_action) {


makes sense according to the backtraces.
might remove line 6854 as well since it's useless now.

zwoop

I think zizhong's suggestion should be done, leaving that assert would be confusing.

shinrich · 2018-09-06T18:20:08Z

Good point @zizhong. Updated to remove the useless assert.

shinrich · 2018-09-10T15:36:45Z

@zizhong @zwoop look ok now?

maskit · 2018-09-11T03:15:52Z

  void cancel_active_timeout() override;
  void cancel_inactivity_timeout() override;
  void set_action(Continuation *c) override;
+  Action *get_action();


const Action *get_action() const; would be better to prevent abuse.

shinrich · 2018-09-12T16:07:21Z

Update the commit to make the get_action() method const and return a const pointer as @maskit suggested.

@zwoop had(has?) this commit running on the trafficserver docs machine for a day or two. We've had this running internally for several weeks.

I know that one of @d2r 's commits is running into this issue. So it would be good if we could bring this fix in soon and possibly make the next 8.0 release candidate.

d2r · 2018-09-12T16:14:17Z

I know that one of @d2r 's commits is running into this issue. So it would be good if we could bring this fix in soon and possibly make the next 8.0 release candidate.

I am testing with this patch now to see if a new assert in #4145 is still triggered.

d2r · 2018-09-12T21:05:26Z

I am testing with this patch now to see if a new assert in #4145 is still triggered.

Testing with this patch still produced errors. We found that the errors was caused with a failed hostdb lookup, and so it is not directly related to this PR. So I dropped the dependency on this PR.

shinrich · 2018-09-19T16:10:15Z

While this wasn't an issue for @d2r's issue. This does fix a real issue. Can someone please review? I think it passed muster with @zwoop and @maskit

shinrich · 2018-09-19T16:22:22Z

Actually just pushed up a new version to expand the assert in HttpSM::state_http_server_open to include the NET_EVENT_OPEN_FAILED

bryancall · 2019-04-11T17:45:26Z

Another PR was created to pull this in to the 8.1.0 release #5235

shinrich added Core Network HostDB Crash labels Aug 27, 2018

shinrich added this to the 9.0.0 milestone Aug 27, 2018

shinrich self-assigned this Aug 27, 2018

zizhong previously approved these changes Aug 27, 2018

View reviewed changes

zwoop requested changes Aug 29, 2018

View reviewed changes

shinrich dismissed zizhong’s stale review via 25ca497 September 6, 2018 18:19

shinrich force-pushed the fix-stale-hostdb-actions branch from 413d43e to 25ca497 Compare September 6, 2018 18:19

zwoop previously approved these changes Sep 10, 2018

View reviewed changes

zizhong previously approved these changes Sep 10, 2018

View reviewed changes

shinrich dismissed stale reviews from zizhong and zwoop via a1ba929 September 10, 2018 18:16

shinrich force-pushed the fix-stale-hostdb-actions branch from a1ba929 to 25ca497 Compare September 10, 2018 20:36

maskit reviewed Sep 11, 2018

View reviewed changes

shinrich force-pushed the fix-stale-hostdb-actions branch from 25ca497 to a63cecc Compare September 12, 2018 16:05

d2r mentioned this pull request Sep 12, 2018

Adds configurable behavior on redirect to loopback #4145

Merged

shinrich force-pushed the fix-stale-hostdb-actions branch from a63cecc to dca3226 Compare September 19, 2018 16:21

SolidWallOfCode previously approved these changes Sep 19, 2018

View reviewed changes

Fix lost pending_actions causing actions on stale objects.

42cd69a

shinrich dismissed SolidWallOfCode’s stale review via 42cd69a September 19, 2018 17:50

shinrich force-pushed the fix-stale-hostdb-actions branch from dca3226 to 42cd69a Compare September 19, 2018 17:50

SolidWallOfCode approved these changes Sep 19, 2018

View reviewed changes

shinrich merged commit 493803c into apache:master Sep 20, 2018

bryancall modified the milestones: 9.0.0, Backported Apr 11, 2019

oknet mentioned this pull request Apr 30, 2019

Rewrite the assert statement to avoid Socks Proxy triggering the assertions #5396

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix lost pending_actions causing actions on stale objects.#4166

Fix lost pending_actions causing actions on stale objects.#4166
shinrich merged 1 commit intoapache:masterfrom
shinrich:fix-stale-hostdb-actions

shinrich commented Aug 27, 2018 •

edited

Loading

Uh oh!

zizhong Aug 27, 2018

Uh oh!

zwoop left a comment

Uh oh!

shinrich commented Sep 6, 2018

Uh oh!

shinrich commented Sep 10, 2018

Uh oh!

maskit Sep 11, 2018

Uh oh!

shinrich commented Sep 12, 2018 •

edited

Loading

Uh oh!

d2r commented Sep 12, 2018

Uh oh!

d2r commented Sep 12, 2018

Uh oh!

shinrich commented Sep 19, 2018

Uh oh!

shinrich commented Sep 19, 2018

Uh oh!

bryancall commented Apr 11, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

shinrich commented Aug 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zizhong Aug 27, 2018

Choose a reason for hiding this comment

Uh oh!

zwoop left a comment

Choose a reason for hiding this comment

Uh oh!

shinrich commented Sep 6, 2018

Uh oh!

shinrich commented Sep 10, 2018

Uh oh!

maskit Sep 11, 2018

Choose a reason for hiding this comment

Uh oh!

shinrich commented Sep 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

d2r commented Sep 12, 2018

Uh oh!

d2r commented Sep 12, 2018

Uh oh!

shinrich commented Sep 19, 2018

Uh oh!

shinrich commented Sep 19, 2018

Uh oh!

bryancall commented Apr 11, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

shinrich commented Aug 27, 2018 •

edited

Loading

shinrich commented Sep 12, 2018 •

edited

Loading