Skip to content

"Aggressive" disconnects from the streaming tracer at shutdown time. #219

@lcapaldo

Description

@lcapaldo

The C++ streaming recorder seems to "never" stop making requests, even if there are no spans to send. This is mostly harmless I think, and it makes sense to optimize for the assumptions of sending spans as opposed to not.

At shutdown of a service though, the fact there is always a http request in flight to the satellites (whether there are spans to send or not) means the Tracer shutdown is almost never clean. It seems (based on my reading of the code) that we Flush, sending spans if there are any, and then the socket is closed before the response of the satellite arrives.

It may also be possible that some of the time response arrives, reaches the code in OnReadable and then reconnects and then that request is aborted abruptly.

I appreciate we do not want to block the process indefinitely on tracer shutdown. This presents a problem for monitoring some middleware or proxies as there are many legitimate reasons to want track client disconnections. For example nginx logs these as the non-standard http status code 499. This can be helpful for identifying mismatched timeouts between client and server or services that aren't performing to the expectations or clients or other issues between the client and server.

Since a "normal" tracer exit will close the connection before receiving a response (and again this is happening even if the tracer is quiescent because there is almost always a request in flight, even if it is empty of spans), this can create a lot of noise in the metrics and logs for this scenario (client closing the connection before the response is sent) that may mask "real" client side timeouts that need to be investigated and resolved.

Not also this isn't just a matter of "well if your network/middleware/etc is fast enough the response will be sent back quickly and you won't see this", I have replicated this with a pretty simple "fake satellite" and the example stream program running on locahost: Example.

It seems like some variant of Flush that would set a flag in OnReadable to close the connection (just FreeSocket instead of Reconnect) could let folks opt-into a clean exit. The graceful shutdown timer could still apply, so even for those folks it wouldn't be an indefinite shutdown, it just wouldn't be a "timeout of 0" like it seems to be today. Another option could be to not initiate new connections if there were no spans to send for some period of time. Again this would a tunable time period and folks could just stop generating spans (likely to happen on the shutdown path anyway). I have not tried to implement either of these, and there may be a nuance or nuances that I am missing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions