Fix M3Reporter.Processor Flaky Tests#71
Conversation
adc48ac to
89777f4
Compare
prateek
left a comment
There was a problem hiding this comment.
Mainly looks fine. Left a few comments inline.
One bigger question: I see a lot of improved hygiene and simpler code, but I don't follow how this fixes anything on master which is breaking. What did you observe that was messing up/how does this fix it?
| public List<Metric> getMetrics() { | ||
| lock.readLock().lock(); | ||
| try { | ||
| return metrics; |
There was a problem hiding this comment.
instead of reference to a guarded variable, should you return a copy here?
There was a problem hiding this comment.
This is just to silence the linter -- this is a test-code, method is only meant to be called at the end of the test
There was a problem hiding this comment.
OK. Please add a WARNING in the method doc letting people know it's broken if used in any other way.
There was a problem hiding this comment.
It's not broken, this is a test method within the text package -- i can hardly imagine it's being used in any other way
There was a problem hiding this comment.
Simple failure mode: if users access the data during test while they're still emitting/changing metrics, it's racy.
There was a problem hiding this comment.
Fair enough, will update
| ); | ||
| } catch (TException tException) { | ||
| LOG.warn("Failed to flush metrics: " + tException.getMessage()); | ||
| } catch (Throwable t) { |
There was a problem hiding this comment.
Is the TException -> Throwable change intentional? If so, what motivated it?
There was a problem hiding this comment.
To prevent uncaughts in the processor thread
There was a problem hiding this comment.
were there particular uncaughts that triggered this? wondering why it's required now?
There was a problem hiding this comment.
B/c failing processor thread due to RuntimeException doesn't make sense
There was a problem hiding this comment.
Hm I don't agree with that. E.g. why should this catch an OutOfMemoryError ?
There was a problem hiding this comment.
Let me start off to say, that i'm not entirely sure where you noticed occasions of talking over each other. I appreciate your in-depth deliberation, but at the same time would like to point out that most of the questions you raise could be easily addressed on your own.
This includes catching things you shouldn't, one e.g. of this is an OOM. If the JVM is throwing an OOM, the process is having much bigger issues than a metrics library malfunctioning. As a user, I want/expect the process to crash, not trudge along in a degraded state.
Agree on the premise that loud failure is preferred, however your conclusion is wrong -- uncaught exception in the thread will not crash the process, it will only crash the reporting thread meaning that application will be left without metrics.
Your example w/ OOM is invalid in nature: library shouldn't be trying to handle the issues it couldn't handle, this should be occupation of the application itself.
Re-iterating on the rationale for this change: producer's thread should be resilient to exceptions, and continue attempting to produce metrics even if each attempt has low chance of success as this is the only way for application to obtain telemetry.
There was a problem hiding this comment.
re: "would like to point out that most of the questions you raise could be easily addressed on your own."
Please call out specific examples.
re: talking over each other -- bad phrasing. I was expressing that I didn't understand our disconnect. I do now based on your point about the JVM crashing the Thread, not the Process when the failure occurs. Good callout.
- Is this practice (logging such errors and continuing) idiomatic in Java?
- Does the thread always end up in a state it can continue executing? i.e. should it create a replacement?
There was a problem hiding this comment.
- This practice has nothing to do with idioms of Java. This approach allows to keep the the processor running even in the face of some transient/permanent issues.
- I don't really understand your question. Why would it not be able to continue?
There was a problem hiding this comment.
- Idioms/prior art are a good indication of what standard practice is in the community. I don't like libraries doing things against the norm.
- You're catching literally the lowest level of error the JVM offers, it could catch any manner of things corrupting thread state -- e.g. StackOverflowError would be caught here, is the thread able to continue execution after catching that? Is that true for ALL errors?
All in all, I haven't seen a strong case that this helps in. It's coming down to a style preference: yours to continue trying, and mine is to log and stop. Until I see a real world case where this helps in, I'm against the making it the default behaviour, further, it's better from a b/w compatibility standpoint as well. That said, I'm not against the library supporting if you want it.
So if you want to include this, please make the library offer a choice so that it's injectable by the user, and override in your service.
There was a problem hiding this comment.
Spoke offline, a good case was brought up that such resilience could actually obscure the problems within the library as, potentially leaving it in a slightly degraded state that might be hard to trace back.
alexeykudinkin
left a comment
There was a problem hiding this comment.
@prateek this fixes flaky tests
Cool, I'm renaming the PR to indicate that. |
Increased t/o to 1 minute (to make sure it accommodates for being run on heavily loaded CI instance)
Enabled tests output; Adding logger to test deps to properly print tests output
|
Took a while to find the very last race condition |
|
@prateek any other comments? |
Fixing
M3Reporter.Processorpreparing it for multi-processor setup.Previously even though
M3Reporterhad been setup to run 10 Processors, it never actually worked properly and by pure luck it wasn't actually corrupting its own state (due to those errors somewhat saving it from it):This change addresses all of those issues.
However, additional tests needs to be written along with current existing fixed.