Fix M3Reporter.Processor Flaky Tests by alexeykudinkin · Pull Request #71 · uber-java/tally

alexeykudinkin · 2020-05-10T21:05:01Z

Fixing M3Reporter.Processor preparing it for multi-processor setup.

Previously even though M3Reporter had been setup to run 10 Processors, it never actually worked properly and by pure luck it wasn't actually corrupting its own state (due to those errors somewhat saving it from it):

Flushing sequence didn't work: flush was only triggered within a single processor
Processors were sharing the same transport without ANY synchronization which might potentially end up in corrupted payload being sent over the wire

This change addresses all of those issues.

However, additional tests needs to be written along with current existing fixed.

CLAassistant · 2020-05-10T21:05:07Z

All committers have signed the CLA.

prateek

Mainly looks fine. Left a few comments inline.

One bigger question: I see a lot of improved hygiene and simpler code, but I don't follow how this fixes anything on master which is breaking. What did you observe that was messing up/how does this fix it?

prateek · 2020-07-20T21:30:31Z

m3/src/test/java/com/uber/m3/tally/m3/MockM3Service.java

+    public List<Metric> getMetrics() {
+        lock.readLock().lock();
+        try {
+            return metrics;


instead of reference to a guarded variable, should you return a copy here?

This is just to silence the linter -- this is a test-code, method is only meant to be called at the end of the test

OK. Please add a WARNING in the method doc letting people know it's broken if used in any other way.

It's not broken, this is a test method within the text package -- i can hardly imagine it's being used in any other way

Simple failure mode: if users access the data during test while they're still emitting/changing metrics, it's racy.

Fair enough, will update

prateek · 2020-07-20T21:37:24Z

m3/src/main/java/com/uber/m3/tally/m3/M3Reporter.java

                );
-            } catch (TException tException) {
-                LOG.warn("Failed to flush metrics: " + tException.getMessage());
+            } catch (Throwable t) {


Is the TException -> Throwable change intentional? If so, what motivated it?

To prevent uncaughts in the processor thread

were there particular uncaughts that triggered this? wondering why it's required now?

B/c failing processor thread due to RuntimeException doesn't make sense

Hm I don't agree with that. E.g. why should this catch an OutOfMemoryError ?

Let me start off to say, that i'm not entirely sure where you noticed occasions of talking over each other. I appreciate your in-depth deliberation, but at the same time would like to point out that most of the questions you raise could be easily addressed on your own.

This includes catching things you shouldn't, one e.g. of this is an OOM. If the JVM is throwing an OOM, the process is having much bigger issues than a metrics library malfunctioning. As a user, I want/expect the process to crash, not trudge along in a degraded state.

Agree on the premise that loud failure is preferred, however your conclusion is wrong -- uncaught exception in the thread will not crash the process, it will only crash the reporting thread meaning that application will be left without metrics.

Your example w/ OOM is invalid in nature: library shouldn't be trying to handle the issues it couldn't handle, this should be occupation of the application itself.

Re-iterating on the rationale for this change: producer's thread should be resilient to exceptions, and continue attempting to produce metrics even if each attempt has low chance of success as this is the only way for application to obtain telemetry.

re: "would like to point out that most of the questions you raise could be easily addressed on your own."
Please call out specific examples.

re: talking over each other -- bad phrasing. I was expressing that I didn't understand our disconnect. I do now based on your point about the JVM crashing the Thread, not the Process when the failure occurs. Good callout.

Is this practice (logging such errors and continuing) idiomatic in Java?

Does the thread always end up in a state it can continue executing? i.e. should it create a replacement?

This practice has nothing to do with idioms of Java. This approach allows to keep the the processor running even in the face of some transient/permanent issues.

I don't really understand your question. Why would it not be able to continue?

Idioms/prior art are a good indication of what standard practice is in the community. I don't like libraries doing things against the norm.

You're catching literally the lowest level of error the JVM offers, it could catch any manner of things corrupting thread state -- e.g. StackOverflowError would be caught here, is the thread able to continue execution after catching that? Is that true for ALL errors?

All in all, I haven't seen a strong case that this helps in. It's coming down to a style preference: yours to continue trying, and mine is to log and stop. Until I see a real world case where this helps in, I'm against the making it the default behaviour, further, it's better from a b/w compatibility standpoint as well. That said, I'm not against the library supporting if you want it.

So if you want to include this, please make the library offer a choice so that it's injectable by the user, and override in your service.

Spoke offline, a good case was brought up that such resilience could actually obscure the problems within the library as, potentially leaving it in a slightly degraded state that might be hard to trace back.

m3/src/main/java/com/uber/m3/tally/m3/M3Reporter.java

m3/src/test/java/com/uber/m3/tally/m3/M3ReporterTest.java

alexeykudinkin

@prateek this fixes flaky tests

prateek · 2020-07-20T23:02:57Z

@prateek this fixes flaky tests

Cool, I'm renaming the PR to indicate that.

Increased t/o to 1 minute (to make sure it accommodates for being run on heavily loaded CI instance)

Enabled tests output; Adding logger to test deps to properly print tests output

alexeykudinkin · 2020-07-22T23:32:19Z

Took a while to find the very last race condition

alexeykudinkin · 2020-07-24T00:08:03Z

@prateek any other comments?

prateek

LGTM

Alexey Kudinkin added 13 commits May 10, 2020 14:05

Fixed processors improperly sharing the transport

7a62004

Tidying up

8163bbf

Added disclaimer

516b3f7

Tidying up

e75a180

Put a guard-rail against uncaughts in the Processor

816a29e

Fixed tests

ef84bfc

Added TODOs

168d732

Fixed closing in multi-processor setup

fb19ef6

Fixed flushing sequence to work properly in multi-processor setup

7c7e7c7

Tidying up

69104f6

Fixed tests

7fb2f3a

Rebased

4cc32e0

Fixed tests

89777f4

alexeykudinkin force-pushed the mp-ff-fix-proc branch from adc48ac to 89777f4 Compare May 10, 2020 21:15

alexeykudinkin mentioned this pull request Jul 16, 2020

Performance Optimizations targeting to reduce CPU churn #73

Merged

prateek requested a review from justinjc July 20, 2020 21:27

prateek reviewed Jul 20, 2020

View reviewed changes

alexeykudinkin commented Jul 20, 2020

View reviewed changes

prateek changed the title ~~Fixed M3Reporter.Processor~~ Fix M3Reporter.Processor Flaky Tests Jul 20, 2020

alexeykudinkin requested a review from prateek July 20, 2020 23:38

Alexey Kudinkin added 8 commits July 21, 2020 16:36

Make metrics snapshotting explicit

15e757c

Tidying up

54695ae

Tidying up;

6ff5e52

Increased t/o to 1 minute (to make sure it accommodates for being run on heavily loaded CI instance)

Extracted wait-timeout to M3ReporterTest

dbcefa3

Make gradle test output verbose

7b3ec45

Reverted gradle debug output;

ce72a94

Enabled tests output; Adding logger to test deps to properly print tests output

Additional logging

69dc2b0

Replaced "localhost" w/ "127.0.0.1"

d764cc6

Alexey Kudinkin added 8 commits July 22, 2020 15:29

Increasing the t/o value

804a39b

Fixed tests to not boot reporter twice

2546bcc

Added more logs

e1dd6fe

Added a little more logs

d980ce2

A little more logs

34441ef

lint

aa6fab4

Fixed RC of MockM3Server not being waited for to boot up

c3512ba

Properly sync on the monitor

e3b7a2f

Alexey Kudinkin added 2 commits July 24, 2020 19:40

Reduced scope of catch-clause, re-throw un-actionable exceptions;

56382f3

lint

bf8513b

prateek approved these changes Jul 27, 2020

View reviewed changes

prateek merged commit 3280651 into uber-java:master Jul 27, 2020

sairamch04 pushed a commit to sairamch04/tally that referenced this pull request Feb 5, 2023

Fix M3Reporter.Processor Flaky Tests (uber-java#71)

ee0701d

Comments

Conversation

alexeykudinkin commented May 10, 2020

Uh oh!

CLAassistant commented May 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

prateek left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexeykudinkin Jul 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alexeykudinkin left a comment

Choose a reason for hiding this comment

Uh oh!

prateek commented Jul 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexeykudinkin commented Jul 22, 2020

Uh oh!

alexeykudinkin commented Jul 24, 2020

Uh oh!

prateek left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CLAassistant commented May 10, 2020 •

edited

Loading

alexeykudinkin Jul 20, 2020 •

edited

Loading

prateek commented Jul 20, 2020 •

edited

Loading