-
Notifications
You must be signed in to change notification settings - Fork 11
Conversation
src/metadatad.cc
Outdated
| updater->stop(); | ||
| } | ||
| std::cerr << "Exiting" << std::endl; | ||
| std::exit(128 + signum); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the exit code 128 + signum?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unix tradtion: http://tldp.org/LDP/abs/html/exitcodes.html.
src/metadatad.cc
Outdated
| } | ||
|
|
||
| std::mutex server_wait_mutex; | ||
| server_wait_mutex.lock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this line required given the lock guard in L84?
Alternatively, should this lock be released after server.start()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lock guard was supposed to deadlock this thread to prevent it from going into the destructor. However, with the suggestion from @bmoyles0117, this is no longer necessary.
src/metadatad.cc
Outdated
| std::cerr << "Stopping server" << std::endl; | ||
| google::cleanup_state->server->stop(); | ||
| std::cerr << "Stopping updaters" << std::endl; | ||
| for (google::MetadataUpdater* updater : google::cleanup_state->updaters) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to find some way of doing this in parallel. If the first updater gets stuck while stopping, it shouldn't prevent the other updaters from a fair chance of shutting down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point, but I think we can just request that the implementation of these methods does not block.
src/metadatad.cc
Outdated
| return parse_result < 0 ? 0 : parse_result; | ||
| } | ||
|
|
||
| std::mutex server_wait_mutex; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about having a cleanup_state.wait() method that can manage the lock instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. In fact, it should encapsulate the stopping of the server/updaters as well. Done.
igorpeshansky
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL.
src/metadatad.cc
Outdated
| updater->stop(); | ||
| } | ||
| std::cerr << "Exiting" << std::endl; | ||
| std::exit(128 + signum); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unix tradtion: http://tldp.org/LDP/abs/html/exitcodes.html.
src/metadatad.cc
Outdated
| std::cerr << "Stopping server" << std::endl; | ||
| google::cleanup_state->server->stop(); | ||
| std::cerr << "Stopping updaters" << std::endl; | ||
| for (google::MetadataUpdater* updater : google::cleanup_state->updaters) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point, but I think we can just request that the implementation of these methods does not block.
src/metadatad.cc
Outdated
| return parse_result < 0 ? 0 : parse_result; | ||
| } | ||
|
|
||
| std::mutex server_wait_mutex; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. In fact, it should encapsulate the stopping of the server/updaters as well. Done.
src/metadatad.cc
Outdated
| } | ||
|
|
||
| std::mutex server_wait_mutex; | ||
| server_wait_mutex.lock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lock guard was supposed to deadlock this thread to prevent it from going into the destructor. However, with the suggestion from @bmoyles0117, this is no longer necessary.
supriyagarg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's rename this PR to focus on the fact that we're handling server shutdowns.
"Ensure that the http server shuts down on sigterm."
src/kubernetes.cc
Outdated
| void KubernetesUpdater::StopUpdater() { | ||
| // TODO: How do we interrupt a watch thread? | ||
| if (config().KubernetesUseWatch()) { | ||
| #if 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're not going to use this, let's remove it for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/metadatad.cc
Outdated
| std::initializer_list<MetadataUpdater*> updaters, MetadataAgent* server) | ||
| : updaters_(updaters), server_(server) { server_wait_mutex_.lock(); } | ||
|
|
||
| void StopAll() const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we're renaming stop to notifyStop, we should rename this accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed.
src/agent.h
Outdated
| void start(); | ||
|
|
||
| // Stops serving. | ||
| void stop(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's rename this to notifyStop, or something similar to ensure that it's clear that this simply signals threads that we're shutting down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This specific one does stop listening to the socket, so I would keep this as Stop.
32ec2ea to
86146ba
Compare
igorpeshansky
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got everything to shut down cleanly by eliminating all thread::join() calls from the shutdown path. PTAL.
src/kubernetes.cc
Outdated
| void KubernetesUpdater::StopUpdater() { | ||
| // TODO: How do we interrupt a watch thread? | ||
| if (config().KubernetesUseWatch()) { | ||
| #if 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/metadatad.cc
Outdated
| std::initializer_list<MetadataUpdater*> updaters, MetadataAgent* server) | ||
| : updaters_(updaters), server_(server) { server_wait_mutex_.lock(); } | ||
|
|
||
| void StopAll() const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed.
src/agent.h
Outdated
| void start(); | ||
|
|
||
| // Stops serving. | ||
| void stop(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This specific one does stop listening to the socket, so I would keep this as Stop.
| } | ||
| server_wait_mutex_.unlock(); | ||
| // Give the notifications some time to propagate. | ||
| std::this_thread::sleep_for(time::seconds(0.1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this guaranteed to be enough time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Empirically, smaller delays were also sufficient, as this just needs enough time for the thread to notice the timer unlock notification and exit the loop. For poller threads, even if it doesn't, nothing bad is going to happen, so I hesitate to introduce a larger wait here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me, that unlocking the server_wait_mutex is essentially a noop, so why wait at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unlocking server_wait_mutex allows the destructors to start executing, which will start joining threads. More cleanup that can proceed in parallel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
| } | ||
|
|
||
| MetadataApiServer::~MetadataApiServer() { | ||
| Stop(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have Stop in this destructor, should we also call stop in MetadataAgent's destructor for consistency? I'm primary concerned about the inconsistency, I'm not sure what the negative effects would be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MetadataAgent's destructor will deallocate both the API server and the reporter, which will invoke their respective destructors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused, why are we calling Stop from MetadataAgent, if we're relying on the destructor? I may not be clear, but it seems confusing that stop gets propagated through multiple channels simultaneously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stop() is idempotent. It's just a notification under the covers, so it's ok to call it more than once. Calling it from the destructor guarantees that the server will also shut down cleanly when the object is deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM
src/metadatad.cc
Outdated
| std::cerr << "Caught SIGTERM; shutting down" << std::endl; | ||
| google::cleanup_state->StartShutdown(); | ||
| std::cerr << "Exiting" << std::endl; | ||
| std::exit(128 + signum); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exit code should be 0 if we terminate successfully. If it's anything but 0, Kubernetes will think that the pod crashed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will only happen when a pod is killed by a health check. Do we really want to report a successful exit in that case? I seem to recall that there was a distinction in pod restart behavior between success and failure exits...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that we do, as this is the way we have implemented it for the logging agent (we didn't do anything to explicitly return 0, it just does, when it exits cleanly.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a fluentd thing. Any other process will exit with exactly this exit code (i.e., 143) on SIGTERM. That's just a Unix convention. I bet heapster would exit with 143 as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We spoke offline, I believe we're going to go with a 0 exit status code on a healthy exit after looking further into how it could positively impact docker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
igorpeshansky
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL
| } | ||
| server_wait_mutex_.unlock(); | ||
| // Give the notifications some time to propagate. | ||
| std::this_thread::sleep_for(time::seconds(0.1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unlocking server_wait_mutex allows the destructors to start executing, which will start joining threads. More cleanup that can proceed in parallel.
src/metadatad.cc
Outdated
| std::cerr << "Caught SIGTERM; shutting down" << std::endl; | ||
| google::cleanup_state->StartShutdown(); | ||
| std::cerr << "Exiting" << std::endl; | ||
| std::exit(128 + signum); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a fluentd thing. Any other process will exit with exactly this exit code (i.e., 143) on SIGTERM. That's just a Unix convention. I bet heapster would exit with 143 as well.
| } | ||
|
|
||
| MetadataApiServer::~MetadataApiServer() { | ||
| Stop(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stop() is idempotent. It's just a notification under the covers, so it's ok to call it more than once. Calling it from the destructor guarantees that the server will also shut down cleanly when the object is deleted.
…er before the updaters.
Also clarify the contract for StopUpdater.
- Function renames (mostly Stop->NotifyStop). - Remove dead code.
86146ba to
9d43060
Compare
bmoyles0117
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🛰
This doesn't fully work, because the agent gets stuck waiting on joining threads, but a second SIGTERM usually does the trick. It does shut down the server first, though.