Since the upstart service tracks the first process that it starts (i.e. the bin/agent/agent shell script) and not the actual agent runtime, it can orphan an agent process and then start a new one.
This typically happens like this:
- upstart tries to stop the agent (
SIGTERM)
- the agent takes a long time to stop (more than the default 5 seconds upstart waits for)
- upstart
SIGKILLs the agent (so, it actually kills the "shell script process")
- upstart cleans the pid and sock files, so the next agent service start launches an agent as if no other agent were running.
On the machine where this took place, no shared resource (port, etc) was still used by the orphaned process except for the go expvar port, which explains why the new agent process could start without issues. Could be a hint to explain what was preventing the orphaned process from exiting cleanly...
Needs further investigation
- why the agent runtime process takes a long time to stop/doesn't stop at all
Potential solutions
- Make upstart
sigkill the child processes too when they don't stop in time (not sure it's possible)
Since the upstart service tracks the first process that it starts (i.e. the
bin/agent/agentshell script) and not the actual agent runtime, it can orphan an agent process and then start a new one.This typically happens like this:
SIGTERM)SIGKILLs the agent (so, it actually kills the "shell script process")On the machine where this took place, no shared resource (port, etc) was still used by the orphaned process except for the go expvar port, which explains why the new agent process could start without issues. Could be a hint to explain what was preventing the orphaned process from exiting cleanly...
Needs further investigation
Potential solutions
sigkillthe child processes too when they don't stop in time (not sure it's possible)