-
Notifications
You must be signed in to change notification settings - Fork 331
SAMZA-2423: Heartbeat failure causes incorrect container shutdown #1240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Symptom: When a container heartbeat fails, the container shutdown
sequence is triggered and the Container is never restarted.
Cause: When a container heartbeat fails, the container shutdown
sequence exists the Container with an exit code of `0` which
marks the container as `Completed` - preventing the JobCoordinator
from restarting the container.
Changes: The container can shutdown exceptionally in the following two ways:
1) Exception in the container
2) Heartbeat Expired
In both paths the ContainerLaunchUtil previously expected a
shared static variable to hold the exception. The change introduced
gets rid of the static variable and checks each path explicitly
and exits with code `1` in both cases.
Tests:
|
Seems like the bug is we end up overwriting the Is that correct? |
|
@mynameborat Correct ! I updated the description to include this. |
|
can we add a unit test for this? |
mynameborat
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the class isn't easily unit testable. Let us get this merged.
Would be nice to follow up with a unit test for the entire util.
|
I agree, looks like there is no way to mock objects that |
…ache#1240) Populate container exception from the listener only if it is null
Symptom:
When a container heartbeat fails, the container shutdown
sequence is triggered and the Container is never restarted.
Cause:
When a container heartbeat fails, the container shutdown
sequence exists the Container with an exit code of
0whichmarks the container as
Completed- preventing the JobCoordinatorfrom restarting the container.
The bug is caused by
containerExceptionoverwritten with the valuereturned by
listener.getContainerExceptionwithout checking ifcontainerExceptionwas already set by the heartbeat monitorChanges:
Check the container exception if its been populated before populating it with the exception from listener.