Skip to content

main: use custom main common to disable listening for termination signals#835

Merged
junr03 merged 5 commits into
masterfrom
disable-termination-signal-handling
May 5, 2020
Merged

main: use custom main common to disable listening for termination signals#835
junr03 merged 5 commits into
masterfrom
disable-termination-signal-handling

Conversation

@junr03
Copy link
Copy Markdown
Member

@junr03 junr03 commented May 4, 2020

Description: Disabling signal handling in the Server::Options makes it so that the server's event dispatcher does not listen for termination signals such as SIGTERM, SIGINT, etc. Previous crashes in iOS were experienced due to out-of-band event loop exit as described in #831. Ignoring termination signals makes it more likely that the event loop will only exit due to Engine destruction.

This PR introduces Envoy::MobileMainCommon, as this is the canonical way to customize how main runs, and options setup per envoyproxy/envoy#3424. The new Envoy::MobileMainCommon also does away with other logic in Envoy::MainCommon that does not apply to Envoy Mobile.

Risk Level: med - low-level change in termination handling
Testing: unit test to assert option is correctly set. End to end test with iOS and Android devices to ensure clean exit when the app using envoy mobile exits (and thus destructs the engine). Moreover, there is no event loop exit any longer when the simulator app receives a SIGTERM, i.e., the event dispatcher is no longer listening to SIGTERM and exiting due to them.

Potentially fixes #831. Will need to verify with wider client release.

Signed-off-by: Jose Nino jnino@lyft.com

Jose Nino added 2 commits May 4, 2020 15:56
…nals

Signed-off-by: Jose Nino <jnino@lyft.com>
Signed-off-by: Jose Nino <jnino@lyft.com>
@junr03
Copy link
Copy Markdown
Member Author

junr03 commented May 4, 2020

@mattklein123, thanks for the ideas and pointers in #831. Based on my research online, there is a good chance that making it so that the event loop does not listen for termination signals will prevent exits outside of the Engine destructor. @Reflejo if you have some time, I'd appreciate your eyes here. Your expertise would definitely help verify my research.

mattklein123
mattklein123 previously approved these changes May 4, 2020
Copy link
Copy Markdown
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@Reflejo
Copy link
Copy Markdown
Contributor

Reflejo commented May 5, 2020

This looks good for the attempt you are trying to make. I'm still not convinced this is due to a signal by reading the other issue but if it is:

Signal handling is hard. I think it's entirely possible for that null pointer bug to be a red herring (ie. the app crashed for a different reason, sent a SIGTERM / SIGINT so things get shut down twice and the initial crash is lost). Bugsnag is trying to catch signals as well and the ordering is unpredictable.

I don't think there are many other cases where iOS will send anything other than SIGKILL but maybe things changed on recent OSes.

Can we send ENVOY_LOG()s to bugsnag as breadcrumbs? I think that'd be helpful. Or at least have a way to consume logs on crashes.

Signed-off-by: Jose Nino <jnino@lyft.com>
Jose Nino added 2 commits May 5, 2020 11:28
Signed-off-by: Jose Nino <jnino@lyft.com>
Signed-off-by: Jose Nino <jnino@lyft.com>
@junr03
Copy link
Copy Markdown
Member Author

junr03 commented May 5, 2020

I think it's entirely possible for that null pointer bug to be a red herring (ie. the app crashed for a different reason, sent a SIGTERM / SIGINT.

I see, potentially. The scenario you described reinforces my wanting to make this change anyways, because the envoy event dispatcher should not be listening to termination signals, and thus leading to envoy's thread termination. The way we organized shutdown the only way to maintain predictable behavior is to have the event loop exit from the exit() call issued in the Engine's destructor. For that reason, this change should still be done to prevent "out of band" event loop exits.

Can we send ENVOY_LOG()s to bugsnag as breadcrumbs? I think that'd be helpful. Or at least have a way to consume logs on crashes.

Yeah, I have thought about that before (doc), but we have not gotten around to prioritize that work.

@junr03 junr03 merged commit 1ba6751 into master May 5, 2020
@junr03 junr03 deleted the disable-termination-signal-handling branch May 5, 2020 20:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

crash: trying to use garbage/null pointer

4 participants