Skip to content
This repository was archived by the owner on Aug 19, 2019. It is now read-only.

Conversation

@igorpeshansky
Copy link
Contributor

This doesn't fully work, because the agent gets stuck waiting on joining threads, but a second SIGTERM usually does the trick. It does shut down the server first, though.

src/metadatad.cc Outdated
updater->stop();
}
std::cerr << "Exiting" << std::endl;
std::exit(128 + signum);
Copy link
Contributor

@supriyagarg supriyagarg May 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the exit code 128 + signum?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/metadatad.cc Outdated
}

std::mutex server_wait_mutex;
server_wait_mutex.lock();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this line required given the lock guard in L84?
Alternatively, should this lock be released after server.start()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lock guard was supposed to deadlock this thread to prevent it from going into the destructor. However, with the suggestion from @bmoyles0117, this is no longer necessary.

src/metadatad.cc Outdated
std::cerr << "Stopping server" << std::endl;
google::cleanup_state->server->stop();
std::cerr << "Stopping updaters" << std::endl;
for (google::MetadataUpdater* updater : google::cleanup_state->updaters) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to find some way of doing this in parallel. If the first updater gets stuck while stopping, it shouldn't prevent the other updaters from a fair chance of shutting down.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point, but I think we can just request that the implementation of these methods does not block.

src/metadatad.cc Outdated
return parse_result < 0 ? 0 : parse_result;
}

std::mutex server_wait_mutex;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about having a cleanup_state.wait() method that can manage the lock instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. In fact, it should encapsulate the stopping of the server/updaters as well. Done.

Copy link
Contributor Author

@igorpeshansky igorpeshansky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL.

src/metadatad.cc Outdated
updater->stop();
}
std::cerr << "Exiting" << std::endl;
std::exit(128 + signum);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/metadatad.cc Outdated
std::cerr << "Stopping server" << std::endl;
google::cleanup_state->server->stop();
std::cerr << "Stopping updaters" << std::endl;
for (google::MetadataUpdater* updater : google::cleanup_state->updaters) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point, but I think we can just request that the implementation of these methods does not block.

src/metadatad.cc Outdated
return parse_result < 0 ? 0 : parse_result;
}

std::mutex server_wait_mutex;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. In fact, it should encapsulate the stopping of the server/updaters as well. Done.

src/metadatad.cc Outdated
}

std::mutex server_wait_mutex;
server_wait_mutex.lock();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lock guard was supposed to deadlock this thread to prevent it from going into the destructor. However, with the suggestion from @bmoyles0117, this is no longer necessary.

Copy link
Contributor

@supriyagarg supriyagarg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@bmoyles0117 bmoyles0117 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename this PR to focus on the fact that we're handling server shutdowns.

"Ensure that the http server shuts down on sigterm."

void KubernetesUpdater::StopUpdater() {
// TODO: How do we interrupt a watch thread?
if (config().KubernetesUseWatch()) {
#if 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're not going to use this, let's remove it for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

src/metadatad.cc Outdated
std::initializer_list<MetadataUpdater*> updaters, MetadataAgent* server)
: updaters_(updaters), server_(server) { server_wait_mutex_.lock(); }

void StopAll() const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we're renaming stop to notifyStop, we should rename this accordingly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed.

src/agent.h Outdated
void start();

// Stops serving.
void stop();
Copy link
Contributor

@bmoyles0117 bmoyles0117 May 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename this to notifyStop, or something similar to ensure that it's clear that this simply signals threads that we're shutting down.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This specific one does stop listening to the socket, so I would keep this as Stop.

@igorpeshansky igorpeshansky changed the title Shut down gracefully on SIGTERM. Ensure that the HTTP server shuts down gracefully on SIGTERM. May 15, 2018
@igorpeshansky igorpeshansky force-pushed the igorp-signal-handling branch from 32ec2ea to 86146ba Compare May 15, 2018 17:14
@igorpeshansky igorpeshansky changed the title Ensure that the HTTP server shuts down gracefully on SIGTERM. Shut down gracefully on SIGTERM. May 15, 2018
Copy link
Contributor Author

@igorpeshansky igorpeshansky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got everything to shut down cleanly by eliminating all thread::join() calls from the shutdown path. PTAL.

void KubernetesUpdater::StopUpdater() {
// TODO: How do we interrupt a watch thread?
if (config().KubernetesUseWatch()) {
#if 0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

src/metadatad.cc Outdated
std::initializer_list<MetadataUpdater*> updaters, MetadataAgent* server)
: updaters_(updaters), server_(server) { server_wait_mutex_.lock(); }

void StopAll() const {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed.

src/agent.h Outdated
void start();

// Stops serving.
void stop();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This specific one does stop listening to the socket, so I would keep this as Stop.

}
server_wait_mutex_.unlock();
// Give the notifications some time to propagate.
std::this_thread::sleep_for(time::seconds(0.1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this guaranteed to be enough time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empirically, smaller delays were also sufficient, as this just needs enough time for the thread to notice the timer unlock notification and exit the loop. For poller threads, even if it doesn't, nothing bad is going to happen, so I hesitate to introduce a larger wait here.

Copy link
Contributor

@bmoyles0117 bmoyles0117 May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me, that unlocking the server_wait_mutex is essentially a noop, so why wait at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlocking server_wait_mutex allows the destructors to start executing, which will start joining threads. More cleanup that can proceed in parallel.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

}

MetadataApiServer::~MetadataApiServer() {
Stop();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have Stop in this destructor, should we also call stop in MetadataAgent's destructor for consistency? I'm primary concerned about the inconsistency, I'm not sure what the negative effects would be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MetadataAgent's destructor will deallocate both the API server and the reporter, which will invoke their respective destructors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused, why are we calling Stop from MetadataAgent, if we're relying on the destructor? I may not be clear, but it seems confusing that stop gets propagated through multiple channels simultaneously.

https://github.com/Stackdriver/metadata-agent/pull/136/files#diff-61b93c57ea92f91ec66fdd4a280d8e8bR40

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stop() is idempotent. It's just a notification under the covers, so it's ok to call it more than once. Calling it from the destructor guarantees that the server will also shut down cleanly when the object is deleted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

src/metadatad.cc Outdated
std::cerr << "Caught SIGTERM; shutting down" << std::endl;
google::cleanup_state->StartShutdown();
std::cerr << "Exiting" << std::endl;
std::exit(128 + signum);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exit code should be 0 if we terminate successfully. If it's anything but 0, Kubernetes will think that the pod crashed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will only happen when a pod is killed by a health check. Do we really want to report a successful exit in that case? I seem to recall that there was a distinction in pod restart behavior between success and failure exits...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that we do, as this is the way we have implemented it for the logging agent (we didn't do anything to explicitly return 0, it just does, when it exits cleanly.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fluentd thing. Any other process will exit with exactly this exit code (i.e., 143) on SIGTERM. That's just a Unix convention. I bet heapster would exit with 143 as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We spoke offline, I believe we're going to go with a 0 exit status code on a healthy exit after looking further into how it could positively impact docker.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor Author

@igorpeshansky igorpeshansky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL

}
server_wait_mutex_.unlock();
// Give the notifications some time to propagate.
std::this_thread::sleep_for(time::seconds(0.1));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlocking server_wait_mutex allows the destructors to start executing, which will start joining threads. More cleanup that can proceed in parallel.

src/metadatad.cc Outdated
std::cerr << "Caught SIGTERM; shutting down" << std::endl;
google::cleanup_state->StartShutdown();
std::cerr << "Exiting" << std::endl;
std::exit(128 + signum);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fluentd thing. Any other process will exit with exactly this exit code (i.e., 143) on SIGTERM. That's just a Unix convention. I bet heapster would exit with 143 as well.

}

MetadataApiServer::~MetadataApiServer() {
Stop();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stop() is idempotent. It's just a notification under the covers, so it's ok to call it more than once. Calling it from the destructor guarantees that the server will also shut down cleanly when the object is deleted.

@igorpeshansky igorpeshansky force-pushed the igorp-signal-handling branch from 86146ba to 9d43060 Compare May 23, 2018 02:15
Copy link
Contributor

@bmoyles0117 bmoyles0117 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🛰

@igorpeshansky igorpeshansky merged commit f93fadb into master May 25, 2018
@igorpeshansky igorpeshansky deleted the igorp-signal-handling branch May 25, 2018 00:33
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants