Skip to content
This repository was archived by the owner on Aug 19, 2019. It is now read-only.

Conversation

@davidbtucker
Copy link
Contributor

No description provided.

return false;
}
for (auto& c : component_callbacks_) {
if (!c.second()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how this works, forgive my naiveness in C++, wouldn't UnregisterCallback remove the key entirely, meaning that the c for the component would not get iterated over in the first place?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't check for the existence of the value, this calls the value (which is a std::function) and checks the result.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bmoyles0117 Unregistering should only happen once the thread is done.
@igorpeshansky That's right.

Copy link
Contributor

@igorpeshansky igorpeshansky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few preliminary comments.

src/api_server.h Outdated
std::shared_ptr<HttpServer::connection> conn);

const Configuration& config_;
HealthChecker* health_checker_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will you be modifying it? Why can't it be a const reference (or a const pointer if you accept nullptr as a valid value)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

conn->set_headers(std::map<std::string, std::string>({
{"Content-Type", "text/plain"},
}));
conn->write("unhealthy");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth dumping the list of unhealthy components (and components whose callbacks returned false)...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

config_.HealthCheckFile()).parent_path());
}

void HealthChecker::RegisterCallback(const std::string& component,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Optional] I'd call this AddComponent or RegisterComponent instead... You're adding a pair of <component_name, callback>.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

return false;
}
for (auto& c : component_callbacks_) {
if (!c.second()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Optional] std::function objects can have a nullptr value. Should this be if (c.second && !c.second())?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if (!unhealthy_components_.empty()) {
return false;
}
for (auto& c : component_callbacks_) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you not lock mutex_ while iterating? It's possible to split this into read and write locks (https://en.cppreference.com/w/cpp/thread/shared_mutex).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, are you concerned about a possible deadlock? What's the right way to do this?

health_checker_->RegisterCallback(name, [&last_received_mutex, &last_received]{
std::lock_guard<std::mutex> last_received_lock(last_received_mutex);
return last_received > (std::chrono::high_resolution_clock::now() -
std::chrono::minutes(5));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a configuration option?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

std::mutex last_received_mutex;
auto last_received = std::chrono::high_resolution_clock::now();
if (health_checker_) {
health_checker_->RegisterCallback(name, [&last_received_mutex, &last_received]{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not capture last_received by const reference?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like capturing by const reference is not available in C++11.

void MetadataApiServer::HandleHealthz(
const HttpServer::request& request,
std::shared_ptr<HttpServer::connection> conn) {
if (health_checker_->IsHealthy()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we allow health_checker_ to be nullptr, so we can test different handlers in isolation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

} catch (const boost::system::system_error& e) {
LOG(ERROR) << "Failed to query " << endpoint << ": " << e.what();
if (health_checker_) {
health_checker_->UnregisterCallback(name);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a good candidate for RAII.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor Author

@davidbtucker davidbtucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, PTAL.

void MetadataApiServer::HandleHealthz(
const HttpServer::request& request,
std::shared_ptr<HttpServer::connection> conn) {
if (health_checker_->IsHealthy()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

conn->set_headers(std::map<std::string, std::string>({
{"Content-Type", "text/plain"},
}));
conn->write("unhealthy");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

src/api_server.h Outdated
std::shared_ptr<HttpServer::connection> conn);

const Configuration& config_;
HealthChecker* health_checker_;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

return false;
}
for (auto& c : component_callbacks_) {
if (!c.second()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

config_.HealthCheckFile()).parent_path());
}

void HealthChecker::RegisterCallback(const std::string& component,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

health_checker_->RegisterCallback(name, [&last_received_mutex, &last_received]{
std::lock_guard<std::mutex> last_received_lock(last_received_mutex);
return last_received > (std::chrono::high_resolution_clock::now() -
std::chrono::minutes(5));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

std::mutex last_received_mutex;
auto last_received = std::chrono::high_resolution_clock::now();
if (health_checker_) {
health_checker_->RegisterCallback(name, [&last_received_mutex, &last_received]{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like capturing by const reference is not available in C++11.

return false;
}
for (auto& c : component_callbacks_) {
if (!c.second()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bmoyles0117 Unregistering should only happen once the thread is done.
@igorpeshansky That's right.

} catch (const boost::system::system_error& e) {
LOG(ERROR) << "Failed to query " << endpoint << ": " << e.what();
if (health_checker_) {
health_checker_->UnregisterCallback(name);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if (!unhealthy_components_.empty()) {
return false;
}
for (auto& c : component_callbacks_) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, are you concerned about a possible deadlock? What's the right way to do this?

Copy link
Contributor

@igorpeshansky igorpeshansky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more stuff.

result.insert(c.first);
}
}
return std::move(result);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a set of strings; there are no movable components here — you don't need std::move.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


// Registers a component and then unregisters when it goes out of
// scope.
class ScopedHealthCheckRegistration {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just CheckHealth?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}
private:
HealthChecker* health_checker_;
const std::string& component_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh, don't do this. If you store a reference to a temporary, you'll cause a crash. Just store a copy of the string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, thanks. Fixed.

}
std::mutex last_received_mutex;
auto last_received = std::chrono::high_resolution_clock::now();
auto timeout = std::chrono::seconds(config_.HealthCheckWatchTimeoutSeconds());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

time::seconds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}
std::mutex last_received_mutex;
auto last_received = std::chrono::high_resolution_clock::now();
auto timeout = std::chrono::seconds(config_.HealthCheckWatchTimeoutSeconds());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like you can have a single variable called expiration that is set to std::chrono::high_resolution_clock::now() + std::chrono::seconds(config_.HealthCheckWatchTimeoutSeconds()) here and in the watcher callback, i.e.:

  std::mutex expiration_mutex;
  auto expiration = std::chrono::high_resolution_clock::now() +
                    time::seconds(config_.HealthCheckWatchTimeoutSeconds());
  ...
      [&expiration_mutex, &expiration]{
        std::lock_guard<std::mutex> expiration_lock(expiration_mutex);
        return std::chrono::high_resolution_clock::now() < expiration;
      }
  ...
      [=, &expiration_mutex, &expiration](json::value raw_watch) {
        {
          std::lock_guard<std::mutex> expiration_lock(expiration_mutex);
          expiration = std::chrono::high_resolution_clock::now() +
                       time::seconds(config_.HealthCheckWatchTimeoutSeconds());
        }
        WatchEventCallback(callback, name, std::move(raw_watch));
      }

Might be a good idea to factor out config_.HealthCheckWatchTimeoutSeconds() into a local variable called timeout, something like:

  const int timeout = config_.HealthCheckWatchTimeoutSeconds();
  auto expiration = std::chrono::high_resolution_clock::now() + time::seconds(timeout);
  ...
      [&expiration_mutex, &expiration]{
        std::lock_guard<std::mutex> expiration_lock(expiration_mutex);
        return std::chrono::high_resolution_clock::now() < expiration;
      }
  ...
      [=, &expiration_mutex, &expiration](json::value raw_watch) {
        {
          std::lock_guard<std::mutex> expiration_lock(expiration_mutex);
          expiration = std::chrono::high_resolution_clock::now() + time::seconds(timeout);
        }
        WatchEventCallback(callback, name, std::move(raw_watch));
      }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks.

LOG(INFO) << "WatchMaster(" << name << "): Contacting " << endpoint;
}
std::mutex last_received_mutex;
auto last_received = std::chrono::high_resolution_clock::now();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your timeout is measured in integral seconds. You probably don't need high_resolution_clock, especially because it may be pointing to system_clock, which can be confused by DST changes and such. Why not use std::steady_clock instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


std::set<std::string> HealthChecker::UnhealthyComponents() const {
std::lock_guard<std::mutex> lock(mutex_);
std::set<std::string> result(unhealthy_components_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we were dropping these?..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I forgot what we said -- still use the unhealthy_components_ for IsHealthy()? (The current unit test wants that.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, they are still needed.

constexpr const char kDefaultInstanceZone[] = "";
constexpr const char kDefaultHealthCheckFile[] =
"/var/run/metadata-agent/health/unhealthy";
constexpr const int kDefaultHealthCheckWatchTimeoutSeconds = 5*60;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: WatchTimeout is too generic, can we change it to KubernetesWatch throughout?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KubernetesWatchTimeout*

Copy link
Contributor Author

@davidbtucker davidbtucker Jul 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to HealthCheckMaxDataAgeSeconds.

Copy link
Contributor

@igorpeshansky igorpeshansky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer RFC — let's rename the PR.
Some remaining minor comments.

[=, &expiration_mutex, &expiration](json::value raw_watch) {
{
std::lock_guard<std::mutex> expiration_lock(expiration_mutex);
expiration = std::chrono::steady_clock::now() +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Optional] Might read better as:

          expiration =
            std::chrono::steady_clock::now() + time::seconds(timeout);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}
const int timeout = config_.HealthCheckWatchTimeoutSeconds();
std::mutex expiration_mutex;
auto expiration = std::chrono::steady_clock::now() + time::seconds(timeout);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a comment here explaining what this is, something like "The time by when the watcher has to receive some data to be considered healthy"...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

CheckHealth check_health(
health_checker_, name, [&expiration_mutex, &expiration]{
std::lock_guard<std::mutex> expiration_lock(expiration_mutex);
return std::chrono::high_resolution_clock::now() < expiration;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::chrono::steady_clock, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, fixed.

}
if (unhealthy_components.empty()) {
if (config_.VerboseLogging()) {
LOG(INFO) << "Healthz returning 200";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's s/Healthz//healthz/ here and in the next log statement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

{"Content-Type", "text/plain"},
}));
conn->write("unhealthy components:\n");
for (const auto& s : unhealthy_components) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Optional] This is a component, so wouldn't c (or even component) work better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@davidbtucker davidbtucker changed the title RFC: Add /healthz endpoint that returns 500 when some watch stream is stale. Add /healthz endpoint that returns 500 when some watch stream is stale. Jul 20, 2018
Copy link
Contributor

@igorpeshansky igorpeshansky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM :shipit:

Copy link
Contributor

@igorpeshansky igorpeshansky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM :shipit:


std::set<std::string> HealthChecker::UnhealthyComponents() const {
std::lock_guard<std::mutex> lock(mutex_);
std::set<std::string> result(unhealthy_components_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, they are still needed.

Copy link
Contributor

@igorpeshansky igorpeshansky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM :shipit:

Copy link
Contributor

@bmoyles0117 bmoyles0117 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@igorpeshansky igorpeshansky merged commit 5c7dde6 into master Jul 22, 2018
@igorpeshansky igorpeshansky deleted the davidbtucker-watch-healthz branch July 22, 2018 15:14
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants