Skip to content

Conversation

@abhishekshivanna
Copy link
Contributor

@abhishekshivanna abhishekshivanna commented Dec 20, 2019

Symptom:
When a container heartbeat fails, the container shutdown
sequence is triggered and the Container is never restarted.

Cause:
When a container heartbeat fails, the container shutdown
sequence exists the Container with an exit code of 0 which
marks the container as Completed - preventing the JobCoordinator
from restarting the container.
The bug is caused by containerException overwritten with the value
returned by listener.getContainerException without checking if
containerException was already set by the heartbeat monitor

Changes:
Check the container exception if its been populated before populating it with the exception from listener.

Symptom: When a container heartbeat fails, the container shutdown
         sequence is triggered and the Container is never restarted.

Cause:   When a container heartbeat fails, the container shutdown
         sequence exists the Container with an exit code of `0` which
         marks the container as `Completed` - preventing the JobCoordinator
         from restarting the container.

Changes: The container can shutdown exceptionally in the following two ways:
         1) Exception in the container
         2) Heartbeat Expired
         In both paths the ContainerLaunchUtil previously expected a
         shared static variable to hold the exception. The change introduced
         gets rid of the static variable and checks each path explicitly
         and exits with code `1` in both cases.

Tests:
@mynameborat
Copy link
Contributor

Seems like the bug is we end up overwriting the containerException with listener.getContainerException without checking if it has already been populated. Due to this, it is possible that container listener doesn't have any exception because it was requested to shut itself down but the request was due to expired heartbeat.

Is that correct?

@abhishekshivanna
Copy link
Contributor Author

abhishekshivanna commented Dec 20, 2019

@mynameborat Correct ! I updated the description to include this.

@mynameborat
Copy link
Contributor

can we add a unit test for this?

@mynameborat mynameborat reopened this Dec 20, 2019
Copy link
Contributor

@mynameborat mynameborat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the class isn't easily unit testable. Let us get this merged.
Would be nice to follow up with a unit test for the entire util.

@abhishekshivanna
Copy link
Contributor Author

I agree, looks like there is no way to mock objects that ContainerLaunchUtil instantiates in order to test this.

@mynameborat mynameborat merged commit 08ee3a3 into apache:master Dec 20, 2019
rmatharu-zz pushed a commit to rmatharu-zz/samza that referenced this pull request Jan 21, 2020
…ache#1240)

Populate container exception from the listener only if it is null
lhaiesp pushed a commit that referenced this pull request Feb 6, 2020
)

Populate container exception from the listener only if it is null
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants