Skip to content

Comments

Fix M3Reporter.Processor Flaky Tests#71

Merged
prateek merged 31 commits intouber-java:masterfrom
alexeykudinkin:mp-ff-fix-proc
Jul 27, 2020
Merged

Fix M3Reporter.Processor Flaky Tests#71
prateek merged 31 commits intouber-java:masterfrom
alexeykudinkin:mp-ff-fix-proc

Conversation

@alexeykudinkin
Copy link
Contributor

Fixing M3Reporter.Processor preparing it for multi-processor setup.

Previously even though M3Reporter had been setup to run 10 Processors, it never actually worked properly and by pure luck it wasn't actually corrupting its own state (due to those errors somewhat saving it from it):

  1. Flushing sequence didn't work: flush was only triggered within a single processor
  2. Processors were sharing the same transport without ANY synchronization which might potentially end up in corrupted payload being sent over the wire

This change addresses all of those issues.

However, additional tests needs to be written along with current existing fixed.

@CLAassistant
Copy link

CLAassistant commented May 10, 2020

CLA assistant check
All committers have signed the CLA.

Copy link
Collaborator

@prateek prateek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly looks fine. Left a few comments inline.

One bigger question: I see a lot of improved hygiene and simpler code, but I don't follow how this fixes anything on master which is breaking. What did you observe that was messing up/how does this fix it?

public List<Metric> getMetrics() {
lock.readLock().lock();
try {
return metrics;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of reference to a guarded variable, should you return a copy here?

Copy link
Contributor Author

@alexeykudinkin alexeykudinkin Jul 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just to silence the linter -- this is a test-code, method is only meant to be called at the end of the test

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Please add a WARNING in the method doc letting people know it's broken if used in any other way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not broken, this is a test method within the text package -- i can hardly imagine it's being used in any other way

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simple failure mode: if users access the data during test while they're still emitting/changing metrics, it's racy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, will update

);
} catch (TException tException) {
LOG.warn("Failed to flush metrics: " + tException.getMessage());
} catch (Throwable t) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the TException -> Throwable change intentional? If so, what motivated it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To prevent uncaughts in the processor thread

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

were there particular uncaughts that triggered this? wondering why it's required now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

B/c failing processor thread due to RuntimeException doesn't make sense

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm I don't agree with that. E.g. why should this catch an OutOfMemoryError ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me start off to say, that i'm not entirely sure where you noticed occasions of talking over each other. I appreciate your in-depth deliberation, but at the same time would like to point out that most of the questions you raise could be easily addressed on your own.

This includes catching things you shouldn't, one e.g. of this is an OOM. If the JVM is throwing an OOM, the process is having much bigger issues than a metrics library malfunctioning. As a user, I want/expect the process to crash, not trudge along in a degraded state.

Agree on the premise that loud failure is preferred, however your conclusion is wrong -- uncaught exception in the thread will not crash the process, it will only crash the reporting thread meaning that application will be left without metrics.

Your example w/ OOM is invalid in nature: library shouldn't be trying to handle the issues it couldn't handle, this should be occupation of the application itself.

Re-iterating on the rationale for this change: producer's thread should be resilient to exceptions, and continue attempting to produce metrics even if each attempt has low chance of success as this is the only way for application to obtain telemetry.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: "would like to point out that most of the questions you raise could be easily addressed on your own."
Please call out specific examples.

re: talking over each other -- bad phrasing. I was expressing that I didn't understand our disconnect. I do now based on your point about the JVM crashing the Thread, not the Process when the failure occurs. Good callout.

  • Is this practice (logging such errors and continuing) idiomatic in Java?
  • Does the thread always end up in a state it can continue executing? i.e. should it create a replacement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. This practice has nothing to do with idioms of Java. This approach allows to keep the the processor running even in the face of some transient/permanent issues.
  2. I don't really understand your question. Why would it not be able to continue?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Idioms/prior art are a good indication of what standard practice is in the community. I don't like libraries doing things against the norm.
  2. You're catching literally the lowest level of error the JVM offers, it could catch any manner of things corrupting thread state -- e.g. StackOverflowError would be caught here, is the thread able to continue execution after catching that? Is that true for ALL errors?

All in all, I haven't seen a strong case that this helps in. It's coming down to a style preference: yours to continue trying, and mine is to log and stop. Until I see a real world case where this helps in, I'm against the making it the default behaviour, further, it's better from a b/w compatibility standpoint as well. That said, I'm not against the library supporting if you want it.

So if you want to include this, please make the library offer a choice so that it's injectable by the user, and override in your service.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spoke offline, a good case was brought up that such resilience could actually obscure the problems within the library as, potentially leaving it in a slightly degraded state that might be hard to trace back.

Copy link
Contributor Author

@alexeykudinkin alexeykudinkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prateek this fixes flaky tests

@prateek
Copy link
Collaborator

prateek commented Jul 20, 2020

@prateek this fixes flaky tests

Cool, I'm renaming the PR to indicate that.

@prateek prateek changed the title Fixed M3Reporter.Processor Fix M3Reporter.Processor Flaky Tests Jul 20, 2020
@alexeykudinkin alexeykudinkin requested a review from prateek July 20, 2020 23:38
Alexey Kudinkin added 8 commits July 21, 2020 16:36
Increased t/o to 1 minute (to make sure it accommodates for being run on heavily loaded CI instance)
Enabled tests output;
Adding logger to test deps to properly print tests output
@alexeykudinkin
Copy link
Contributor Author

Took a while to find the very last race condition

@alexeykudinkin
Copy link
Contributor Author

@prateek any other comments?

Copy link
Collaborator

@prateek prateek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@prateek prateek merged commit 3280651 into uber-java:master Jul 27, 2020
sairamch04 pushed a commit to sairamch04/tally that referenced this pull request Feb 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants