-
Notifications
You must be signed in to change notification settings - Fork 11
Be more robust in handling broken chunked encoding. #58
Conversation
bmoyles0117
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that this is actually solving the problem at hand. Why are we appending the number of bytes remaining to the end of the stream?
|
This is now handled differently. PTAL. |
bmoyles0117
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm torn on this, being that we're unlocking the completion mutex now, it emphasizes that these threads can die, and not having reconnect logic metadata agent, or at least a health check endpoint that tells us we're not running, means that the metadata agent will go on living forever, having its threads die, but not exiting as a process. WDYT?
|
I've verified that this at least causes the watch thread to exit gracefully instead of going into an infinite loop. PTAL. |
|
I'd rather handle these issues separately. We do need thread restart logic, and we do need a healthcheck endpoint. But we should also prevent the infinite loop. WDYT? |
|
I'm ok with that, potentially as a simple follow up PR, I would suggest we make these errors INCREDIBLY obvious, as a sign to the user that "something" is likely broken, so when they look at the logs it's not something as simple as "Thread Exited", it should have a huge warning around it that the agent should likely be restarted. |
|
Ok, let's revisit after we get the basics fixed. I have a couple of follow-up PRs depending on this one, so PTAL. |
qingling128
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's fix the infinite loop first since this is quite urgent. The rest could be handled in follow-up PRs.
[Optional] Do we want to rephrase the error (Invalid chunked encoding) in a way that it implies the impact?
|
@qingling128 Do you have any concrete proposals for the phrasing? If so, could you please add them in a comment on the code? |
eb3a130 to
b674482
Compare
|
Rebased off |
src/kubernetes.cc
Outdated
| << std::string(begin, end) | ||
| << "'"; | ||
| return boost::iterator_range<const char*>(begin, end); | ||
| << "'; exiting"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like "'; Skipping the current chunk and exiting the thread. Metadata of certain monitored resource might not be up to date.";
?
Not very sure if the impact is accurate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take a look at the new message. WDYT?
igorpeshansky
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proper fix turned out to be trickier than I thought, but I think I got it now. PTAL.
src/kubernetes.cc
Outdated
| << std::string(begin, end) | ||
| << "'"; | ||
| return boost::iterator_range<const char*>(begin, end); | ||
| << "'; exiting"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take a look at the new message. WDYT?
| LOG(ERROR) << "No more pod metadata will be collected"; | ||
| } catch (const KubernetesReader::QueryException& e) { | ||
| // Already logged. | ||
| LOG(ERROR) << "No more pod metadata will be collected"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Distinguish the log of these two errors for debugging purpose? Same below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They would each be preceded by another log message that details the error. That should be enough to distinguish them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. Then we should be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're not logging the e.what() here as we do above, is that because it's logged elsewhere? Can we juggle things around for consistency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's because it's logged elsewhere. The code is fairly consistent about logging any time a QueryException is thrown, so we don't need to log when catching it.
I've just added one missing log statement above.
| namespace { | ||
| struct Watcher { | ||
| Watcher(std::function<void(json::value)> event_callback, | ||
| Watcher(const std::string& endpoint, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea adding a watching name to make the error log more informative.
| } | ||
|
|
||
| try { | ||
| //#ifdef VERBOSE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seem to be many commented-out lines like this. Should we clean them up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not in this PR. I have a bug open to factor out the chunked encoding handler — will clean this up when I do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
src/kubernetes.cc
Outdated
| return range; | ||
| LOG(ERROR) << name_ << " => " | ||
| << "Asked to read next chunk with no bytes remaining"; | ||
| return range; // TODO: should this throw an exception instead? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like an exception to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
igorpeshansky
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, PTAL.
| } | ||
|
|
||
| try { | ||
| //#ifdef VERBOSE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not in this PR. I have a bug open to factor out the chunked encoding handler — will clean this up when I do that.
| LOG(ERROR) << "No more pod metadata will be collected"; | ||
| } catch (const KubernetesReader::QueryException& e) { | ||
| // Already logged. | ||
| LOG(ERROR) << "No more pod metadata will be collected"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They would each be preceded by another log message that details the error. That should be enough to distinguish them.
src/kubernetes.cc
Outdated
| return range; | ||
| LOG(ERROR) << name_ << " => " | ||
| << "Asked to read next chunk with no bytes remaining"; | ||
| return range; // TODO: should this throw an exception instead? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
qingling128
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
| } | ||
|
|
||
| try { | ||
| //#ifdef VERBOSE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
| LOG(ERROR) << "No more pod metadata will be collected"; | ||
| } catch (const KubernetesReader::QueryException& e) { | ||
| // Already logged. | ||
| LOG(ERROR) << "No more pod metadata will be collected"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. Then we should be fine.
src/kubernetes.cc
Outdated
| void operator()(const boost::iterator_range<const char*>& range, | ||
| const boost::system::error_code& error) { | ||
| if (!error) { | ||
| if (!exception_.empty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Being that exception is a string, can we change the name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially had it as error_, but that ended up being too easy to confuse with the error argument. Any naming suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like exception_message_ or exception_string_ just to add clarify that we can't treat it like an exception (despite it clearly being defined as a string)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to exception_message_.
| } | ||
| std::lock_guard<std::mutex> await_completion(completion_mutex); | ||
| if (!watcher.exception().empty()) { | ||
| throw QueryException(watcher.exception()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need to throw an exception here? Either way, the thread will end up exiting. Does this exception bubble up to kill the main process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't the same thread. This exception is thrown from the thread that is waiting for the watcher thread. It will be caught by callers of WatchMaster.
| LOG(ERROR) << "No more pod metadata will be collected"; | ||
| } catch (const KubernetesReader::QueryException& e) { | ||
| // Already logged. | ||
| LOG(ERROR) << "No more pod metadata will be collected"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're not logging the e.what() here as we do above, is that because it's logged elsewhere? Can we juggle things around for consistency?
| @@ -1028,8 +1084,9 @@ void KubernetesReader::WatchNode(MetadataUpdater::UpdateCallback callback) | |||
| std::placeholders::_2, std::placeholders::_3)); | |||
| } catch (const json::Exception& e) { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to handle the WatcherException here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No -- it will never be thrown in this thread, and exceptions don't propagate across threads.
| WatcherException(const std::string& what) : explanation_(what) {} | ||
| const std::string& what() const { return explanation_; } | ||
| private: | ||
| std::string explanation_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this called explanation_ instead of what_?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency with all other exceptions. :-) If we choose to change it, we should do it in one fell swoop, but we could also keep it this way.
e.what() is fairly standard in Boost and others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with e.what(), I just got confused why the local variable is called explanation_, not critical in either case so happy to just approve.
igorpeshansky
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed your feedback. PTAL.
| WatcherException(const std::string& what) : explanation_(what) {} | ||
| const std::string& what() const { return explanation_; } | ||
| private: | ||
| std::string explanation_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency with all other exceptions. :-) If we choose to change it, we should do it in one fell swoop, but we could also keep it this way.
e.what() is fairly standard in Boost and others.
src/kubernetes.cc
Outdated
| void operator()(const boost::iterator_range<const char*>& range, | ||
| const boost::system::error_code& error) { | ||
| if (!error) { | ||
| if (!exception_.empty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially had it as error_, but that ended up being too easy to confuse with the error argument. Any naming suggestions?
| } | ||
| std::lock_guard<std::mutex> await_completion(completion_mutex); | ||
| if (!watcher.exception().empty()) { | ||
| throw QueryException(watcher.exception()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't the same thread. This exception is thrown from the thread that is waiting for the watcher thread. It will be caught by callers of WatchMaster.
| LOG(ERROR) << "No more pod metadata will be collected"; | ||
| } catch (const KubernetesReader::QueryException& e) { | ||
| // Already logged. | ||
| LOG(ERROR) << "No more pod metadata will be collected"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's because it's logged elsewhere. The code is fairly consistent about logging any time a QueryException is thrown, so we don't need to log when catching it.
I've just added one missing log statement above.
| @@ -1028,8 +1084,9 @@ void KubernetesReader::WatchNode(MetadataUpdater::UpdateCallback callback) | |||
| std::placeholders::_2, std::placeholders::_3)); | |||
| } catch (const json::Exception& e) { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No -- it will never be thrown in this thread, and exceptions don't propagate across threads.
bmoyles0117
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
igorpeshansky
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reviews. Merging.
src/kubernetes.cc
Outdated
| void operator()(const boost::iterator_range<const char*>& range, | ||
| const boost::system::error_code& error) { | ||
| if (!error) { | ||
| if (!exception_.empty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to exception_message_.
This is a preliminary PR to check whether this fixed the problem.