Skip to content

Change kill and stop implementation to enhance compatibility#8163

Merged
lcastellano merged 1 commit intovmware:masterfrom
lcastellano:kill
Aug 1, 2018
Merged

Change kill and stop implementation to enhance compatibility#8163
lcastellano merged 1 commit intovmware:masterfrom
lcastellano:kill

Conversation

@lcastellano
Copy link
Contributor

@lcastellano lcastellano commented Jul 27, 2018

This PR implements a set of changes that makes VIC behave like the Docker server when dealing with "kill" and "stop" commands. When a "kill" command is sent to a container, only the top process receives the signal. When a "stop" command is sent to a container the Stop Signal is sent to the top process, after 10 seconds a SIGKILL signal is sent to all the member of the process group.

This PR contains the following set of changes:

  1. Send signals only to top process (instead of to the process group) when sending kill (fixes issue 'docker kill' creates different behavior on VIC 1.4.0 and 1.2.1 #8152)
  2. For stop send the Stop Signal (default SIGTERM) to the top process, followed by a SIGKILL to the members of the process group (if all processes have not exited within 10 seconds)
  3. Fix issue where the Tether would block while trying to send the SIGKILL and leave the VM in an incorrect state, causing the following PowerOff command to fail on ESXi (issue https://github.com/vmware/vic-tasks/issues/73)

[specific ci=1-14-Docker-Kill]

@lcastellano lcastellano requested a review from a team as a code owner July 27, 2018 22:55
@lcastellano lcastellano requested review from hickeng and zjs July 27, 2018 23:04
pid = session.Cmd.Process.Pid
log.Infof("sending signal %s (%d) to process %d for %s", sig.Signal, num, pid, session.ID)
} else {
pid = -session.Cmd.Process.Pid
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the negative sign here, but not above? (It seems like there's some significance to the sign of the pid, but I'm not familiar enough with this code to know what it means.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sending a signal to a positive pid sends the signal to one process only; the one identified by the pid. Sending a signal to a negative pid sends the signal to all the processes belonging to the process group identified by the positive value of the pid.

stop := []string{cs.StopSignal, string(ssh.SIGKILL)}
if stop[0] == "" {
stop[0] = string(ssh.SIGTERM)
stopActions := []string{"kill", "groupKill"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these be defined as constants somewhere? I see they're used in a switch statement below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There really should be. There is the lib/tether/shared package specifically for constants that should be shared between the tether and other subsystems.

I had to read this twice and that's knowing what the intent is. For clarity can we define the action/signal grouping together explicitly. Whether as a map[string][ssh.Action] or as:

stopSignal := cs.StopSignal
if stopSignal == "" {
	stopSignal = string(ssh.SIGTERM)
}

// action/signal pairs
stopActions := []string{"kill", "groupKill"}
stopSignals := []string{stopSignal, string(ssh.SIGKILL)}

var killed bool
...

stop := []string{cs.StopSignal, string(ssh.SIGKILL)}
if stop[0] == "" {
stop[0] = string(ssh.SIGTERM)
stopActions := []string{"kill", "groupKill"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There really should be. There is the lib/tether/shared package specifically for constants that should be shared between the tether and other subsystems.

I had to read this twice and that's knowing what the intent is. For clarity can we define the action/signal grouping together explicitly. Whether as a map[string][ssh.Action] or as:

stopSignal := cs.StopSignal
if stopSignal == "" {
	stopSignal = string(ssh.SIGTERM)
}

// action/signal pairs
stopActions := []string{"kill", "groupKill"}
stopSignals := []string{stopSignal, string(ssh.SIGKILL)}

var killed bool
...

session.Unlock()

// Don't hold the lock while waiting for the
// file descriptors to close
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// Don't hold the lock while waiting for the file descriptors
// to close as these can be held open by child processes

log.Infof("sending signal %s (%d) to process %d for %s", sig.Signal, num, pid, session.ID)
} else {
pid = -session.Cmd.Process.Pid
log.Infof("sending signal %s (%d) to process group %d for %s", sig.Signal, num, -pid, session.ID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the -pid intended here - it's a double negative but maybe that's what you were after explicitly for printing the pgid?

${rc} ${output}= Run And Return Rc And Output docker %{VCH-PARAMS} kill -s HUP ${id}
Should Be Equal As Integers ${rc} 0
Wait Until Keyword Succeeds 20x 200 milliseconds Assert Container Output ${id} KillSignalHUP
Wait Until Keyword Succeeds 20x 200 milliseconds Assert Not In Container Output ${id} KillSignalHUP
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no point in Wait Until ... here - that was present previously to wait for expected output. With the inversion of the test this will either immediately succeed or always fail.

That said I'm not sure this test should have it inverted.... I thought we were still wanting to deliver the signal to the main process, which should be the shell running the trap command in this case? Was this supposed to be changed in the test Confirm signal is not delivered to entire process group?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of this test is to make sure that the child did not receive the signal. We could assert that the parent gets the signal, but there is an existing test (that does not use the nested shell) just above this one that does exactly that. I am not sure why we shouldn't wait for the output to drain; that is what makes sure that the child process did not receive the signal and processed it by writing the message to the log. This test is meant to catch the error that caused issue: #8152.

@lcastellano lcastellano force-pushed the kill branch 4 times, most recently from 1219bad to d556822 Compare July 31, 2018 21:29
stopSignal = string(ssh.SIGTERM)
}

type actionSignal struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the type declaration is needed if using an anonymous declaration as well.
Alternatively use the type name when declaring the array.

@@ -96,9 +102,9 @@ Confirm signal delivered to entire process group
Should Be Equal As Integers ${rc} 1
${rc} ${output}= Run And Return Rc And Output docker %{VCH-PARAMS} kill -s HUP ${id}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to wait for the HUP signal to be confirmed before we call stop:

Wait Until Keyword Succeeds  20x  200 milliseconds  Assert In Container Output  ${id}  KillSignalHUP

THEN we stop it and assert that the child didn't print output.

@lcastellano lcastellano force-pushed the kill branch 2 times, most recently from e68c0ff to dc95575 Compare August 1, 2018 00:44
Copy link
Member

@zjs zjs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I entirely understand this change, but the tests seem to show that it works as expected. Thank you for the in-person explanation! This is to make the way we handle docker stop and docker killconsistent with the way the Docker daemon handles those commands.

@lcastellano lcastellano merged commit ef7abe0 into vmware:master Aug 1, 2018
zjs pushed a commit to zjs/vic that referenced this pull request Aug 2, 2018
zjs added a commit that referenced this pull request Aug 7, 2018
@lcastellano lcastellano deleted the kill branch August 21, 2018 23:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants