Fix lost pending_actions causing actions on stale objects.#4166
Fix lost pending_actions causing actions on stale objects.#4166shinrich merged 1 commit intoapache:masterfrom
Conversation
| // then the value of kill_this_async_done has changed so | ||
| // we must check it again | ||
| if (kill_this_async_done == true) { | ||
| if (pending_action) { |
There was a problem hiding this comment.
makes sense according to the backtraces.
might remove line 6854 as well since it's useless now.
zwoop
left a comment
There was a problem hiding this comment.
I think zizhong's suggestion should be done, leaving that assert would be confusing.
413d43e to
25ca497
Compare
|
Good point @zizhong. Updated to remove the useless assert. |
a1ba929 to
25ca497
Compare
| void cancel_active_timeout() override; | ||
| void cancel_inactivity_timeout() override; | ||
| void set_action(Continuation *c) override; | ||
| Action *get_action(); |
There was a problem hiding this comment.
const Action *get_action() const; would be better to prevent abuse.
25ca497 to
a63cecc
Compare
|
Update the commit to make the get_action() method const and return a const pointer as @maskit suggested. @zwoop had(has?) this commit running on the trafficserver docs machine for a day or two. We've had this running internally for several weeks. I know that one of @d2r 's commits is running into this issue. So it would be good if we could bring this fix in soon and possibly make the next 8.0 release candidate. |
Testing with this patch still produced errors. We found that the errors was caused with a failed hostdb lookup, and so it is not directly related to this PR. So I dropped the dependency on this PR. |
a63cecc to
dca3226
Compare
|
Actually just pushed up a new version to expand the assert in HttpSM::state_http_server_open to include the NET_EVENT_OPEN_FAILED |
dca3226 to
42cd69a
Compare
|
Another PR was created to pull this in to the 8.1.0 release #5235 |
We have been discovering and fixing places where hostdb was invoking handlers against HttpSM without a lock.
As we have been running this fix in production, our set of crashes has changed. Specifically we were seeing HOSTDB_LOOKUP events showing up in HttpSM::state_http_server_open and causing an unexpected event assert. I think these changes were pushing crashes down the pipe. Looking more closely at the cores, the HOSTDB_LOOKUP event was being invoked against a stale HttpSM. Looking further down the stack, the hostname that DNS was using was different that the server hostname referenced in HttpSM. So it seems that there was a pending hostdb action that wasn't canceled before the origin HttpSM object was destroyed. The other exciting observation was that the cores seemed to happen in groups across machines implying there was some odd off box actions affecting the ATS flow.
Here is an example of that stack.
I created another build that added ink_release_asserts on pending_action in HttpSM::state_http_server_open. The assertion is that pending_action should be nullptr before reassigning it. If you reassign a pending_action that is not null, you lose the reference to that action and you won't be able to cancel that action when HttpSM shuts down. From that build, we were able to get core with the original HttpSM. Here is an example core.
In this case, the last bit of history on the HttpSM is interesting.
The first line says that we called state_http_server_open with a NET_OPEN event. Basically ATS has sent the SYN to the origin.
The second line says that state_http_server_open was called with an inactivity timeout error.
The third line is setting the default handler to HttpSM::state_mark_os_down.
But the fourth line is back at state_http_server_open with a EVENT_ERROR. Which is odd because we have already done the error stuff for that socket.
Looking into the code, the error case will try additional addresses for the origin. This may just pick up the next address in the cached record, but if we are unlucky the cached DNS info may have expired in the 30-40 seconds since we started. If we go through the error case again, we may set up two dns requests for the same state machine. It is quite possible that the state machine will be long gone by the time the second DNS request goes off. If we have overwritten the pending_action at the start of HttpSM::state_http_server_open, there is no way to cancel the event before the state machine goes away.
To address this issue and avoid the double error event, the right thing to do is call release_server_session() in the error case of HttpSM::state_http_server_open. If it session has read no response, that function will call do_io_close on the server vc (freeing the backing memory) and remove the related server_entry structure and epoll entries.
We have been running this code in one colo since Thursday night and have had no crashes. Where before we were getting crashes very regularly during high traffic time (presumably because of overloaded origins).