-
Notifications
You must be signed in to change notification settings - Fork 11
Add /healthz endpoint that returns 500 when some watch stream is stale. #165
Conversation
src/health_checker.cc
Outdated
| return false; | ||
| } | ||
| for (auto& c : component_callbacks_) { | ||
| if (!c.second()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how this works, forgive my naiveness in C++, wouldn't UnregisterCallback remove the key entirely, meaning that the c for the component would not get iterated over in the first place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't check for the existence of the value, this calls the value (which is a std::function) and checks the result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bmoyles0117 Unregistering should only happen once the thread is done.
@igorpeshansky That's right.
igorpeshansky
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few preliminary comments.
src/api_server.h
Outdated
| std::shared_ptr<HttpServer::connection> conn); | ||
|
|
||
| const Configuration& config_; | ||
| HealthChecker* health_checker_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will you be modifying it? Why can't it be a const reference (or a const pointer if you accept nullptr as a valid value)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/api_server.cc
Outdated
| conn->set_headers(std::map<std::string, std::string>({ | ||
| {"Content-Type", "text/plain"}, | ||
| })); | ||
| conn->write("unhealthy"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably worth dumping the list of unhealthy components (and components whose callbacks returned false)...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/health_checker.cc
Outdated
| config_.HealthCheckFile()).parent_path()); | ||
| } | ||
|
|
||
| void HealthChecker::RegisterCallback(const std::string& component, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Optional] I'd call this AddComponent or RegisterComponent instead... You're adding a pair of <component_name, callback>.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/health_checker.cc
Outdated
| return false; | ||
| } | ||
| for (auto& c : component_callbacks_) { | ||
| if (!c.second()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Optional] std::function objects can have a nullptr value. Should this be if (c.second && !c.second())?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/health_checker.cc
Outdated
| if (!unhealthy_components_.empty()) { | ||
| return false; | ||
| } | ||
| for (auto& c : component_callbacks_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should you not lock mutex_ while iterating? It's possible to split this into read and write locks (https://en.cppreference.com/w/cpp/thread/shared_mutex).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, are you concerned about a possible deadlock? What's the right way to do this?
src/kubernetes.cc
Outdated
| health_checker_->RegisterCallback(name, [&last_received_mutex, &last_received]{ | ||
| std::lock_guard<std::mutex> last_received_lock(last_received_mutex); | ||
| return last_received > (std::chrono::high_resolution_clock::now() - | ||
| std::chrono::minutes(5)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a configuration option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/kubernetes.cc
Outdated
| std::mutex last_received_mutex; | ||
| auto last_received = std::chrono::high_resolution_clock::now(); | ||
| if (health_checker_) { | ||
| health_checker_->RegisterCallback(name, [&last_received_mutex, &last_received]{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not capture last_received by const reference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like capturing by const reference is not available in C++11.
src/api_server.cc
Outdated
| void MetadataApiServer::HandleHealthz( | ||
| const HttpServer::request& request, | ||
| std::shared_ptr<HttpServer::connection> conn) { | ||
| if (health_checker_->IsHealthy()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we allow health_checker_ to be nullptr, so we can test different handlers in isolation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/kubernetes.cc
Outdated
| } catch (const boost::system::system_error& e) { | ||
| LOG(ERROR) << "Failed to query " << endpoint << ": " << e.what(); | ||
| if (health_checker_) { | ||
| health_checker_->UnregisterCallback(name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a good candidate for RAII.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
davidbtucker
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, PTAL.
src/api_server.cc
Outdated
| void MetadataApiServer::HandleHealthz( | ||
| const HttpServer::request& request, | ||
| std::shared_ptr<HttpServer::connection> conn) { | ||
| if (health_checker_->IsHealthy()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/api_server.cc
Outdated
| conn->set_headers(std::map<std::string, std::string>({ | ||
| {"Content-Type", "text/plain"}, | ||
| })); | ||
| conn->write("unhealthy"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/api_server.h
Outdated
| std::shared_ptr<HttpServer::connection> conn); | ||
|
|
||
| const Configuration& config_; | ||
| HealthChecker* health_checker_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/health_checker.cc
Outdated
| return false; | ||
| } | ||
| for (auto& c : component_callbacks_) { | ||
| if (!c.second()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/health_checker.cc
Outdated
| config_.HealthCheckFile()).parent_path()); | ||
| } | ||
|
|
||
| void HealthChecker::RegisterCallback(const std::string& component, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/kubernetes.cc
Outdated
| health_checker_->RegisterCallback(name, [&last_received_mutex, &last_received]{ | ||
| std::lock_guard<std::mutex> last_received_lock(last_received_mutex); | ||
| return last_received > (std::chrono::high_resolution_clock::now() - | ||
| std::chrono::minutes(5)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/kubernetes.cc
Outdated
| std::mutex last_received_mutex; | ||
| auto last_received = std::chrono::high_resolution_clock::now(); | ||
| if (health_checker_) { | ||
| health_checker_->RegisterCallback(name, [&last_received_mutex, &last_received]{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like capturing by const reference is not available in C++11.
src/health_checker.cc
Outdated
| return false; | ||
| } | ||
| for (auto& c : component_callbacks_) { | ||
| if (!c.second()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bmoyles0117 Unregistering should only happen once the thread is done.
@igorpeshansky That's right.
src/kubernetes.cc
Outdated
| } catch (const boost::system::system_error& e) { | ||
| LOG(ERROR) << "Failed to query " << endpoint << ": " << e.what(); | ||
| if (health_checker_) { | ||
| health_checker_->UnregisterCallback(name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/health_checker.cc
Outdated
| if (!unhealthy_components_.empty()) { | ||
| return false; | ||
| } | ||
| for (auto& c : component_callbacks_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, are you concerned about a possible deadlock? What's the right way to do this?
igorpeshansky
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more stuff.
src/health_checker.cc
Outdated
| result.insert(c.first); | ||
| } | ||
| } | ||
| return std::move(result); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a set of strings; there are no movable components here — you don't need std::move.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/health_checker.h
Outdated
|
|
||
| // Registers a component and then unregisters when it goes out of | ||
| // scope. | ||
| class ScopedHealthCheckRegistration { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just CheckHealth?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/health_checker.h
Outdated
| } | ||
| private: | ||
| HealthChecker* health_checker_; | ||
| const std::string& component_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uh, don't do this. If you store a reference to a temporary, you'll cause a crash. Just store a copy of the string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, thanks. Fixed.
src/kubernetes.cc
Outdated
| } | ||
| std::mutex last_received_mutex; | ||
| auto last_received = std::chrono::high_resolution_clock::now(); | ||
| auto timeout = std::chrono::seconds(config_.HealthCheckWatchTimeoutSeconds()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
time::seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/kubernetes.cc
Outdated
| } | ||
| std::mutex last_received_mutex; | ||
| auto last_received = std::chrono::high_resolution_clock::now(); | ||
| auto timeout = std::chrono::seconds(config_.HealthCheckWatchTimeoutSeconds()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like you can have a single variable called expiration that is set to std::chrono::high_resolution_clock::now() + std::chrono::seconds(config_.HealthCheckWatchTimeoutSeconds()) here and in the watcher callback, i.e.:
std::mutex expiration_mutex;
auto expiration = std::chrono::high_resolution_clock::now() +
time::seconds(config_.HealthCheckWatchTimeoutSeconds());
...
[&expiration_mutex, &expiration]{
std::lock_guard<std::mutex> expiration_lock(expiration_mutex);
return std::chrono::high_resolution_clock::now() < expiration;
}
...
[=, &expiration_mutex, &expiration](json::value raw_watch) {
{
std::lock_guard<std::mutex> expiration_lock(expiration_mutex);
expiration = std::chrono::high_resolution_clock::now() +
time::seconds(config_.HealthCheckWatchTimeoutSeconds());
}
WatchEventCallback(callback, name, std::move(raw_watch));
}Might be a good idea to factor out config_.HealthCheckWatchTimeoutSeconds() into a local variable called timeout, something like:
const int timeout = config_.HealthCheckWatchTimeoutSeconds();
auto expiration = std::chrono::high_resolution_clock::now() + time::seconds(timeout);
...
[&expiration_mutex, &expiration]{
std::lock_guard<std::mutex> expiration_lock(expiration_mutex);
return std::chrono::high_resolution_clock::now() < expiration;
}
...
[=, &expiration_mutex, &expiration](json::value raw_watch) {
{
std::lock_guard<std::mutex> expiration_lock(expiration_mutex);
expiration = std::chrono::high_resolution_clock::now() + time::seconds(timeout);
}
WatchEventCallback(callback, name, std::move(raw_watch));
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks.
src/kubernetes.cc
Outdated
| LOG(INFO) << "WatchMaster(" << name << "): Contacting " << endpoint; | ||
| } | ||
| std::mutex last_received_mutex; | ||
| auto last_received = std::chrono::high_resolution_clock::now(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your timeout is measured in integral seconds. You probably don't need high_resolution_clock, especially because it may be pointing to system_clock, which can be confused by DST changes and such. Why not use std::steady_clock instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
|
||
| std::set<std::string> HealthChecker::UnhealthyComponents() const { | ||
| std::lock_guard<std::mutex> lock(mutex_); | ||
| std::set<std::string> result(unhealthy_components_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we were dropping these?..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I forgot what we said -- still use the unhealthy_components_ for IsHealthy()? (The current unit test wants that.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, they are still needed.
src/configuration.cc
Outdated
| constexpr const char kDefaultInstanceZone[] = ""; | ||
| constexpr const char kDefaultHealthCheckFile[] = | ||
| "/var/run/metadata-agent/health/unhealthy"; | ||
| constexpr const int kDefaultHealthCheckWatchTimeoutSeconds = 5*60; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: WatchTimeout is too generic, can we change it to KubernetesWatch throughout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KubernetesWatchTimeout*
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to HealthCheckMaxDataAgeSeconds.
igorpeshansky
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is no longer RFC — let's rename the PR.
Some remaining minor comments.
src/kubernetes.cc
Outdated
| [=, &expiration_mutex, &expiration](json::value raw_watch) { | ||
| { | ||
| std::lock_guard<std::mutex> expiration_lock(expiration_mutex); | ||
| expiration = std::chrono::steady_clock::now() + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Optional] Might read better as:
expiration =
std::chrono::steady_clock::now() + time::seconds(timeout);There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/kubernetes.cc
Outdated
| } | ||
| const int timeout = config_.HealthCheckWatchTimeoutSeconds(); | ||
| std::mutex expiration_mutex; | ||
| auto expiration = std::chrono::steady_clock::now() + time::seconds(timeout); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add a comment here explaining what this is, something like "The time by when the watcher has to receive some data to be considered healthy"...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/kubernetes.cc
Outdated
| CheckHealth check_health( | ||
| health_checker_, name, [&expiration_mutex, &expiration]{ | ||
| std::lock_guard<std::mutex> expiration_lock(expiration_mutex); | ||
| return std::chrono::high_resolution_clock::now() < expiration; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::chrono::steady_clock, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, fixed.
src/api_server.cc
Outdated
| } | ||
| if (unhealthy_components.empty()) { | ||
| if (config_.VerboseLogging()) { | ||
| LOG(INFO) << "Healthz returning 200"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's s/Healthz//healthz/ here and in the next log statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/api_server.cc
Outdated
| {"Content-Type", "text/plain"}, | ||
| })); | ||
| conn->write("unhealthy components:\n"); | ||
| for (const auto& s : unhealthy_components) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Optional] This is a component, so wouldn't c (or even component) work better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
igorpeshansky
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM ![]()
igorpeshansky
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM ![]()
|
|
||
| std::set<std::string> HealthChecker::UnhealthyComponents() const { | ||
| std::lock_guard<std::mutex> lock(mutex_); | ||
| std::set<std::string> result(unhealthy_components_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, they are still needed.
igorpeshansky
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM ![]()
bmoyles0117
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
No description provided.