HTTP/2 Session & ConnectionState Cleanup#7845
HTTP/2 Session & ConnectionState Cleanup#7845masaori335 wants to merge 4 commits intoapache:masterfrom
Conversation
|
I suggest making these changes below separately beforehand to make reading the diff easy a little:
|
| connection_state.send_goaway_frame(connection_state.get_latest_stream_id_in(), Http2ErrorCode::HTTP2_ERROR_INTERNAL_ERROR); | ||
|
|
||
| case HTTP2_SESSION_EVENT_XMIT: | ||
| // Disable read - read is not needed by state_goaway | ||
| read_vio->done(); | ||
|
|
||
| // Set nbyte to issue WRITE_COMPLETE event | ||
| int64_t len = _write_buffer_reader->read_avail(); | ||
| write_vio->nbytes = write_vio->ndone + len; | ||
|
|
||
| // signal timeout event to txns | ||
| connection_state.cleanup_streams(event); |
There was a problem hiding this comment.
It may be too early to start removing dup code, but I'd have this process in _switch_to_goaway_state. It'd also make debugging easy since we'd be able to set a breakpoint for the state change.
ddadb2e to
048b158
Compare
8189c20 to
414b884
Compare
414b884 to
27e5061
Compare
|
@masaori335 What is the state of this PR? Should it be reviewed? |
27e5061 to
f3030b6
Compare
|
It's ready to review. This passed AuTest and some additional tests in our lab. I was planning to test this patch on production, but I'm not sure when I can do it. Because we haven't started testing 9.1.x. |
264a4f0 to
7ed597d
Compare
|
I pulled in this change and the other H2 cleanups into our 9.1 branch and tried it in production. It crashed pretty quickly with the following stack. Looks like we are having problems on shutdown. The HttpSM (the cont in the read_vio) has already been deleted and its mutex has been cleared. The Http2Stream _history is It looks like the HttpSM history is also still around. The end of that is It seems that the Http2Stream had already delivered an EOS which caused the HttpSM to clean itself up. Will continue to poke around. |
|
Got another core that looks the same before I could get the traffic off. |
| } else if (read_vio.cont && read_vio.op == VIO::READ) { | ||
| SCOPED_MUTEX_LOCK(lock, read_vio.cont->mutex, this_ethread()); | ||
| read_vio.cont->handleEvent(event, &read_vio); | ||
| } |
There was a problem hiding this comment.
This second clause is causing our use after free. Like 209 sends the EOS to our HttpTunnel which then causes everything to cleanup. Then in line 215 we try to access the HttpSM via the write_vio. Don't unnderstand why we need to do this two times.
case VC_EVENT_ERROR:
case VC_EVENT_EOS:
if (e && e->cookie == &read_vio) {
SCOPED_MUTEX_LOCK(lock, read_vio.mutex, this_ethread());
read_vio.cont->handleEvent(event, &read_vio);
} else if (e && e->cookie == &write_vio) {
SCOPED_MUTEX_LOCK(lock, write_vio.mutex, this_ethread());
write_vio.cont->handleEvent(event, &write_vio);
}
// This assuming stream on inbound side
// TODO: make appropriate state
if (recv_end_stream && write_vio.cont && write_vio.op == VIO::WRITE && write_vio.nbytes != 0) {
SCOPED_MUTEX_LOCK(lock, write_vio.cont->mutex, this_ethread());
write_vio.cont->handleEvent(event, &write_vio);
} else if (read_vio.cont && read_vio.op == VIO::READ) {
SCOPED_MUTEX_LOCK(lock, read_vio.cont->mutex, this_ethread());
read_vio.cont->handleEvent(event, &read_vio);
}
There was a problem hiding this comment.
Removing the second clause got rid of the quick use after free, but eventually a new core occurred.
There was a problem hiding this comment.
Thanks for testing! It looks like I haven't tested cases in which Http2Stream schedules events to itself.
Don't unnderstand why we need to do this two times.
Right, we need only one time. We need a guard of null check of e for the second block.
There was a problem hiding this comment.
f57bde4 covers this (and probably the last crash on HttpSM::do_http_server_open).
|
After fixing the use-after-free, we get this crash after a few minutes. Still running the ASAN build, so no other obvious memory reuse problems before the crash. At first glance the stack doesn't seem to have anything to do with http2, but the ua_txn is an Http2Stream and ua_txn->_proxy_ssn->_vc is null. So the assert that is trying to catch responses showing up on the wrong thread triggers. Stream _history Session _history. Our stream had _id 1 and _state HTTP2_STREAM_STATE_HALF_CLOSED_REMOTE, presumably because it had sent a header with no body. HttpSM history |
| // No more read/write op | ||
| _vc->do_io_close(); | ||
| _vc = nullptr; | ||
|
|
There was a problem hiding this comment.
I presume this is the immediate cause of the state_cache_open_write crash. In the core dying_event is set to 104. But presumably we did not get the HttpSM to stop directly as a result of the cleanup_streams() and the cache_write_open event snuck in before we could get everything shutdown.
In the current code the _vc->do_io_close() is not called until everything is shutdown and we are getting ready to free the Http2ClientSession object.
There was a problem hiding this comment.
It looks like this is a race of callback from AIO and closing vc. Actually, I don't understand why HttpSM::state_cache_open_write needs the client vc. IMO, we can write to the cache without client vc.
In the current code the _vc->do_io_close() is not called until everything is shutdown and we are getting ready to free the Http2ClientSession object.
But this is what we have now. I'll adjust this PR to follow this assumption. However, in the long term, we might want to change the design - e.g. when a client is aborted & background-fill is working, we wouldn't want to keep resources for the aborted client.
There was a problem hiding this comment.
d53ac5c get rid of this and ignore VC events on state_aborted for just in case.
|
One last crash before I shutdown tonight. I moved the _vc->do_io_close back to the Http2ClientSession::free and traffic_server ran for many minutes. But eventually it crashed with the stack below. It looks like the server session timed out and signaled the HttpSM. But the ua_txn referenced a Http2Stream that had already at least partially cleaned up. Eventually causing a segfault. I didn't see any likely shutdown signal from the client in the HttpSM history. |
511ed48 to
6d13f95
Compare
6d13f95 to
b516eca
Compare
|
I started a trial of this PR based on 9.1.0 (with various other cleanup PRs without ASan). I observed few metric changes below, but it has been running for a day without crashes, so far. 1). Inactive timeout cases are counted by |
|
You are going to fix the two metric issues above, right? At minimum closing connections with GOAWAY frame is not necessarily due to an error. I hope #8281 didn't mess up your changes too much. |
|
Yup, I'm going to fix them. At least, I'll rebase on the latest branch and deal with the #8281 change. |
|
[approve ci] |
|
This pull request has been automatically marked as stale because it has not had recent activity. Marking it stale to flag it for further consideration by the community. |
Part of address #6877. One of things makes HTTP/2 Session complicated is
Http2ClientSession&Http2ConnectionStateare calling each other asContinuation. To make this simple, I propose below.1). Make
Http2ConnectionStatejust a class2). Introduce states of HTTP/2 session as Continuation Handler
https://gist.github.com/masaori335/f16a851b50bec2f69e4034844a40ab59#file-http2ssn-uml