Skip to content
This repository was archived by the owner on Aug 19, 2019. It is now read-only.

Conversation

@igorpeshansky
Copy link
Contributor

This is a preliminary PR to check whether this fixed the problem.

Copy link
Contributor

@bmoyles0117 bmoyles0117 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that this is actually solving the problem at hand. Why are we appending the number of bytes remaining to the end of the stream?

@igorpeshansky
Copy link
Contributor Author

This is now handled differently. PTAL.

Copy link
Contributor

@bmoyles0117 bmoyles0117 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm torn on this, being that we're unlocking the completion mutex now, it emphasizes that these threads can die, and not having reconnect logic metadata agent, or at least a health check endpoint that tells us we're not running, means that the metadata agent will go on living forever, having its threads die, but not exiting as a process. WDYT?

@igorpeshansky
Copy link
Contributor Author

I've verified that this at least causes the watch thread to exit gracefully instead of going into an infinite loop. PTAL.

@igorpeshansky
Copy link
Contributor Author

I'd rather handle these issues separately. We do need thread restart logic, and we do need a healthcheck endpoint. But we should also prevent the infinite loop. WDYT?

@bmoyles0117
Copy link
Contributor

I'm ok with that, potentially as a simple follow up PR, I would suggest we make these errors INCREDIBLY obvious, as a sign to the user that "something" is likely broken, so when they look at the logs it's not something as simple as "Thread Exited", it should have a huge warning around it that the agent should likely be restarted.

@igorpeshansky
Copy link
Contributor Author

Ok, let's revisit after we get the basics fixed. I have a couple of follow-up PRs depending on this one, so PTAL.

Copy link

@qingling128 qingling128 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's fix the infinite loop first since this is quite urgent. The rest could be handled in follow-up PRs.

[Optional] Do we want to rephrase the error (Invalid chunked encoding) in a way that it implies the impact?

@igorpeshansky
Copy link
Contributor Author

@qingling128 Do you have any concrete proposals for the phrasing? If so, could you please add them in a comment on the code?

@igorpeshansky igorpeshansky force-pushed the igorp-watch-chunked-fix branch from eb3a130 to b674482 Compare February 26, 2018 20:06
@igorpeshansky
Copy link
Contributor Author

Rebased off master.

<< std::string(begin, end)
<< "'";
return boost::iterator_range<const char*>(begin, end);
<< "'; exiting";

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like "'; Skipping the current chunk and exiting the thread. Metadata of certain monitored resource might not be up to date.";
?

Not very sure if the impact is accurate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a look at the new message. WDYT?

Copy link
Contributor Author

@igorpeshansky igorpeshansky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proper fix turned out to be trickier than I thought, but I think I got it now. PTAL.

<< std::string(begin, end)
<< "'";
return boost::iterator_range<const char*>(begin, end);
<< "'; exiting";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a look at the new message. WDYT?

LOG(ERROR) << "No more pod metadata will be collected";
} catch (const KubernetesReader::QueryException& e) {
// Already logged.
LOG(ERROR) << "No more pod metadata will be collected";

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Distinguish the log of these two errors for debugging purpose? Same below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They would each be preceded by another log message that details the error. That should be enough to distinguish them.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. Then we should be fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not logging the e.what() here as we do above, is that because it's logged elsewhere? Can we juggle things around for consistency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's because it's logged elsewhere. The code is fairly consistent about logging any time a QueryException is thrown, so we don't need to log when catching it.
I've just added one missing log statement above.

namespace {
struct Watcher {
Watcher(std::function<void(json::value)> event_callback,
Watcher(const std::string& endpoint,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea adding a watching name to make the error log more informative.

}

try {
//#ifdef VERBOSE

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seem to be many commented-out lines like this. Should we clean them up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in this PR. I have a bug open to factor out the chunked encoding handler — will clean this up when I do that.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

return range;
LOG(ERROR) << name_ << " => "
<< "Asked to read next chunk with no bytes remaining";
return range; // TODO: should this throw an exception instead?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like an exception to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor Author

@igorpeshansky igorpeshansky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, PTAL.

}

try {
//#ifdef VERBOSE
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in this PR. I have a bug open to factor out the chunked encoding handler — will clean this up when I do that.

LOG(ERROR) << "No more pod metadata will be collected";
} catch (const KubernetesReader::QueryException& e) {
// Already logged.
LOG(ERROR) << "No more pod metadata will be collected";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They would each be preceded by another log message that details the error. That should be enough to distinguish them.

return range;
LOG(ERROR) << name_ << " => "
<< "Asked to read next chunk with no bytes remaining";
return range; // TODO: should this throw an exception instead?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link

@qingling128 qingling128 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

}

try {
//#ifdef VERBOSE

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

LOG(ERROR) << "No more pod metadata will be collected";
} catch (const KubernetesReader::QueryException& e) {
// Already logged.
LOG(ERROR) << "No more pod metadata will be collected";

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. Then we should be fine.

void operator()(const boost::iterator_range<const char*>& range,
const boost::system::error_code& error) {
if (!error) {
if (!exception_.empty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Being that exception is a string, can we change the name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially had it as error_, but that ended up being too easy to confuse with the error argument. Any naming suggestions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like exception_message_ or exception_string_ just to add clarify that we can't treat it like an exception (despite it clearly being defined as a string)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to exception_message_.

}
std::lock_guard<std::mutex> await_completion(completion_mutex);
if (!watcher.exception().empty()) {
throw QueryException(watcher.exception());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to throw an exception here? Either way, the thread will end up exiting. Does this exception bubble up to kill the main process?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't the same thread. This exception is thrown from the thread that is waiting for the watcher thread. It will be caught by callers of WatchMaster.

LOG(ERROR) << "No more pod metadata will be collected";
} catch (const KubernetesReader::QueryException& e) {
// Already logged.
LOG(ERROR) << "No more pod metadata will be collected";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not logging the e.what() here as we do above, is that because it's logged elsewhere? Can we juggle things around for consistency?

@@ -1028,8 +1084,9 @@ void KubernetesReader::WatchNode(MetadataUpdater::UpdateCallback callback)
std::placeholders::_2, std::placeholders::_3));
} catch (const json::Exception& e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to handle the WatcherException here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No -- it will never be thrown in this thread, and exceptions don't propagate across threads.

WatcherException(const std::string& what) : explanation_(what) {}
const std::string& what() const { return explanation_; }
private:
std::string explanation_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this called explanation_ instead of what_?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency with all other exceptions. :-) If we choose to change it, we should do it in one fell swoop, but we could also keep it this way.
e.what() is fairly standard in Boost and others.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with e.what(), I just got confused why the local variable is called explanation_, not critical in either case so happy to just approve.

Copy link
Contributor Author

@igorpeshansky igorpeshansky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed your feedback. PTAL.

WatcherException(const std::string& what) : explanation_(what) {}
const std::string& what() const { return explanation_; }
private:
std::string explanation_;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency with all other exceptions. :-) If we choose to change it, we should do it in one fell swoop, but we could also keep it this way.
e.what() is fairly standard in Boost and others.

void operator()(const boost::iterator_range<const char*>& range,
const boost::system::error_code& error) {
if (!error) {
if (!exception_.empty()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially had it as error_, but that ended up being too easy to confuse with the error argument. Any naming suggestions?

}
std::lock_guard<std::mutex> await_completion(completion_mutex);
if (!watcher.exception().empty()) {
throw QueryException(watcher.exception());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't the same thread. This exception is thrown from the thread that is waiting for the watcher thread. It will be caught by callers of WatchMaster.

LOG(ERROR) << "No more pod metadata will be collected";
} catch (const KubernetesReader::QueryException& e) {
// Already logged.
LOG(ERROR) << "No more pod metadata will be collected";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's because it's logged elsewhere. The code is fairly consistent about logging any time a QueryException is thrown, so we don't need to log when catching it.
I've just added one missing log statement above.

@@ -1028,8 +1084,9 @@ void KubernetesReader::WatchNode(MetadataUpdater::UpdateCallback callback)
std::placeholders::_2, std::placeholders::_3));
} catch (const json::Exception& e) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No -- it will never be thrown in this thread, and exceptions don't propagate across threads.

Copy link
Contributor

@bmoyles0117 bmoyles0117 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor Author

@igorpeshansky igorpeshansky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reviews. Merging.

void operator()(const boost::iterator_range<const char*>& range,
const boost::system::error_code& error) {
if (!error) {
if (!exception_.empty()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to exception_message_.

@igorpeshansky igorpeshansky merged commit 2bbad32 into master Feb 27, 2018
@igorpeshansky igorpeshansky deleted the igorp-watch-chunked-fix branch February 27, 2018 21:42
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants