Fix assert due to unheld nh->mutex#5950
Conversation
|
Fixes issue #5943 |
|
Couldn't we just change the MUTEX_TRY_LOCK to a SCOPED_MUTEX_LOCK in UnixNetVConnection::add_to_active_queue()? I don't like that we are now double locking and using dynamic casting. |
|
Thinking about it more while eating a snack, why are we updating the NetHandler from two different threads? That seems like a bigger issue. |
|
Yes, updating from a different thread is concerning. Thus, the addition of the ink_release_assert to start with. Presumably another continuation/action picked up the nh->mutex and got scheduled on another thread. In this particular crash case, the lock had been dropped by the time the core dumped, so I don't know what was behaving badly. |
bryancall
left a comment
There was a problem hiding this comment.
I am -1 on this change. It doesn't fix the issue and has the similar behavior as the existing try lock with more complexity.
In our initial 9.0.x testing, we got a number of cores with the following stack.
The immediate failure is that the MUTEX_TRY_LOCK on nh->mutex failed in UnixNetVConnection::add_to_active_queue. Earlier we had issues with other threads trying to add to the active/keep-alive queues eventually causing corrupted queues.
In this case (of the two cores I looked at), the UnixNetVConnection mutex and the EThread mutex are the same. And at the time of the core, the mutex is not being held, but presumably it was moments before. I assume another thread was making a very transient grab for the nh->mutex.
In this path, the event HTTP2_SESSION_EVENT_REENABLE is being sent every 128 frames presumably to break up clients dominating the thread with very large data. This is new logic compared to our 7.1.x build.
This thread to thread signaling does not grab the nh->mutex as a more standard network event driven process.
This PR checks to make sure that the request is being made from the correct thread and performing a blocking lock in that case.
Diving back into my bug archives I see that one of our users was complaining about a similar stack from Http1 land on our 7.x build. So placing the fix in ProxySession should address both issues.