-
Notifications
You must be signed in to change notification settings - Fork 5.3k
fix deadlock in Quic #56600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix deadlock in Quic #56600
Conversation
|
Tagging subscribers to this area: @dotnet/ncl Issue DetailsAny MsQuic API can block. We should never call it under lock. Tl;dr I investigated this on locally failing HTTP/3 test run than the 1888367 Thread: stuck waiting for and then finally So HttpClient wanted to create new stream, that calls SendAsync and that tries to acquire stream: runtime/src/libraries/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/Http3Connection.cs Lines 167 to 184 in 09c146c
That calls WaitForAvailableBidirectionalStreamsAsync on MsQuicConnection, grabs lock on runtime/src/libraries/System.Net.Quic/src/System/Net/Quic/Implementations/MsQuic/MsQuicConnection.cs Lines 485 to 511 in 38ae818
The lock in Since the whole code in Http3Connection is not atomic I feel there is no reason to that stream count under lock. Now there is really no reason why I know we talk about rewriting this in completely different way. To limit risk I simply replaced lock + increment with It would be great if this can be reviewed at high priority @stephentoub and @geoffkizer
|
|
This is great @wfurt -- this might allow us to enable most of our disabled tests! |
src/libraries/System.Net.Quic/src/System/Net/Quic/Implementations/MsQuic/MsQuicConnection.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Net.Quic/src/System/Net/Quic/Implementations/MsQuic/MsQuicConnection.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Net.Quic/src/System/Net/Quic/Implementations/MsQuic/MsQuicConnection.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Net.Quic/src/System/Net/Quic/Implementations/MsQuic/MsQuicConnection.cs
Outdated
Show resolved
Hide resolved
|
I would suggest that we add asserts at every place we call into msquic to ensure we are not holding the lock. |
Any MsQuic API can block. We should never call it under lock.
The real fix is in WaitForAvailableBidirectionalStreamsAsync() (and unused WaitForAvailableUnidirectionalStreamsAsync)
Tl;dr
I investigated this on locally failing HTTP/3 test run
Several threads like:
than the 1888367 Thread:
stuck waiting for
GetUShortParam()to finish.and then finally
So HttpClient wanted to create new stream, that calls SendAsync and that tries to acquire stream:
runtime/src/libraries/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/Http3Connection.cs
Lines 167 to 184 in 09c146c
That calls WaitForAvailableBidirectionalStreamsAsync on MsQuicConnection, grabs lock on
stateand dives to MsQuic.MsQuic is handling event so it does not finish but instead of that it hands out event via
ConnectionEventand we try to lock thestateagain while adding new Stream from MSQuic thread.runtime/src/libraries/System.Net.Quic/src/System/Net/Quic/Implementations/MsQuic/MsQuicConnection.cs
Lines 485 to 511 in 38ae818
The lock in
TryAddStreamis somewhat irrelevant. We could get any other event(s) so I feel the real issue is calling MsQuic from within lock.Since the whole code in Http3Connection is not atomic I feel there is no reason to that stream count under lock.
We do out best since that can change and Http will try it gain.
Now there is really no reason why
AddStreamandRemoveStreeamneeds to lock wholeState. Its as convenient for synchronization but since neither function really changes thestateI decided to simply useInterlockedprimitives.I know we talk about rewriting this in completely different way. To limit risk I simply replaced lock + increment with
Interlockedwhile keeping the logic and structure mainly untouched.It would be great if this can be reviewed at high priority @stephentoub and @geoffkizer
I think this contributes to Http3 test flakiness and should be fixed soon.
I stills see some other failures in local runs but I'm yet to see mysterious timeout.
I'll keep running test to see if there are any negative impacts.