Add a config for monitorScheduler type by jihoonson · Pull Request #10732 · apache/druid

jihoonson · 2021-01-07T02:28:26Z

#10448 modified the MonitorScheduler to use CronScheduler instead of ScheduledExecutorService. This change looks good to me except that I'm not sure how well-tested CronScheduler is. This PR adds the previous ScheduledExecutorService-based MonitorScheduler back, and a new config, druid.monitoring.schedulerClassName, to determine what type of MonitorScheduler to use. This PR also changes the default MonitorScheduler back to BasicMonitorScheduler. Some brave users may want to explore the new ClockDriftSafeMonitorScheduler. The new config is intentionally not documented as we will get rid of it once the new scheduler is proven to be safe. However, it should be called out in the release notes.

This PR additionally fixes 3 bugs in MonitorScheduler.

When an exception is thrown in monitor.monitor(), the behaviour has changed unexpectedly to stop the monitor. The monitor will ignore exceptions and continue working after this PR as it used to do.
There is a race condition between when a scheduledFuture is set in a monitor and when the scheduledFuture is used. This will not likely happen in production since the first cronTask will be executed after the emitter period, but is possible in theory.
Added CronScheduler support as a proof to clock drift while emitting metrics #10448 changed to use 64 threads for monitoring which seems an overkill to me. This PR changed it back to use a single thread.

An additional change is that a new monitor task will not be scheduled if there is a previous one still running. For example, there is a monitor task which took 3 seconds to complete for some reason. In this case, there will be only one set of metrics emitted during these 3 seconds which are captured when a monitor task started. I think this is OK because we can avoid the growing queue in the scheduler even though some metrics can be missing.

This PR has:

been self-reviewed.
- using the concurrency checklist (Remove this item if the PR doesn't have any relation to concurrency.)
added documentation for new or modified features or behaviors.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added or updated version, license, or notice information in licenses.yaml
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
added integration tests.
been tested in a test Druid cluster.

lgtm-com · 2021-01-07T03:37:18Z

This pull request introduces 1 alert when merging 6b791d1 into 48e576a - view on LGTM.com

new alerts:

1 for Dereferenced variable may be null

…nitor-scheduler

pjain1 · 2021-01-10T06:20:20Z

 public class DruidMonitorSchedulerConfig extends MonitorSchedulerConfig
 {
+  @JsonProperty
+  private String schedulerClassName = ClockDriftSafeMonitorScheduler.class.getName();


this changes the default to this monitor schedule type, is this ok/tested ?

It was changed in #10448, not in this PR. This PR is just to make it configurable because I'm not sure how stable it is. As noted in #10448 (comment), CronScheduler seems to have a not-bad test coverage and worked well in my testing.

I guess it will get tested in the next RC then.

Yes, I have done some testing before and will do more.

My 2¢: the best plan is to default to the old one, and then in the future (after some people have enabled the new one in production) we should switch to the new one, and remove the old one and remove the config entirely.

Rationale:

The new scheduler is designed to eliminate potential clock drift for monitors. This reward is real but is pretty small impact. I don't expect anything bad will happen if the schedule drifts a bit. The main risk of the new scheduler, I suppose, is that there's some case where it goes haywire, and either locks up completely or fires much more often than it should. I'm not sure how likely this is, but it's (a) hard to test for, (b) quite bad if it happens.

So, because the potential reward has a small impact, and the potential risk has a large impact, I think it's best to default to the old scheduler for another release or so. Just until such time as people have been able to do long-running tests in production and have found that there are no issues.

At any rate, it's good that this is undocumented, since it's an inside-baseball sort of config that we would only want to exist for a few releases.

By the way, if anyone has been running the patch in production for a while already, now would be a good time to speak up. If we have already built up a good amount of confidence then I think it makes sense to default to the new one.

I've also thought about this a bunch and have changed my opinion on whether or not we should change the scheduler to a new dependency by default a few times.

While changing to use the CronScheduler might fix a bug, it isn't clear whether any users have run into this in the field. I thought about documenting why a user would want to change the scheduler to CronScheduler instead of the older implementation, and I couldn't think of a good user facing reason to do so. So if we set the default to the old implementation, I don't think anyone would test it in production, so it would continue to live as dead code, and we'll have the same dilemma in the next release or 2 when we ask whether or not this has been run in production.

Setting the default to the older implementation reduces the impact of any bug that might show up in long running tests (even though this library was specifically built to fix issues found with long running processes). The drawback here is finding a reason for some users to try this in production so that we can sunset the feature flag in a release or 2.

Writing out this comment, I now think the more cautious approach - keeping the default the same - is better as it's hard to articulate the benefit for switching the scheduling and taking on the risk associated with changing the older behavior.

So, because the potential reward has a small impact, and the potential risk has a large impact, I think it's best to default to the old scheduler for another release or so. Just until such time as people have been able to do long-running tests in production and have found that there are no issues.

This makes sense to me. I think we can do more extensive testing by ourselves instead of rushing to change the default.

Changed the default back to BasicMonitorScheduler.

pjain1 · 2021-01-10T06:21:52Z

+          @Override
+          public void run(long scheduledRunTimeMillis)
+          {
+            waitForScheduleFutureToBeSet();


why not just use a CountDownLatch instead of continuously checking in continuous loop that counts down after scheduleFutureReference is set

We can use it, but it seems not matter much to me since this loop is not supposed to run at all in production.

clintropolis · 2021-01-12T04:13:48Z

+        TimeUnit.MILLISECONDS,
+        new CronTask()
+        {
+          private Future<?> scheduleFuture = null;


nit: I wonder if these can have better names: scheduledFuture, scheduleFuture, and scheduleFutureReference is a bit too close to each other and is sort of confusing at first glance. Should the inner one perhaps be called cancellationFuture or something to distinguish it from the external one?

Ah, forgot to clean them up before commit. Thanks 👍

jihoonson · 2021-01-14T01:20:40Z

@clintropolis @pjain1 @suneet-s @gianm thanks for the review.

* Add a config for monitorScheduler type * check interrupted * null check * do not schedule monitor if the previous one is still running * checkstyle * clean up names * change default back to basic * fix test

jihoonson added 2 commits January 6, 2021 18:02

Add a config for monitorScheduler type

627d08e

check interrupted

6b791d1

jihoonson added Bug Release Notes Area - Metrics/Event Emitting labels Jan 7, 2021

null check

b26f767

jihoonson mentioned this pull request Jan 7, 2021

Added CronScheduler support as a proof to clock drift while emitting metrics #10448

Merged

4 tasks

do not schedule monitor if the previous one is still running

d7e1855

jihoonson added this to the 0.21.0 milestone Jan 8, 2021

jihoonson added 2 commits January 7, 2021 21:01

checkstyle

fe695f9

Merge branch 'master' of github.com:apache/druid into configurable-mo…

fa6539a

…nitor-scheduler

pjain1 reviewed Jan 10, 2021

View reviewed changes

clintropolis approved these changes Jan 12, 2021

View reviewed changes

jihoonson added 3 commits January 12, 2021 00:44

clean up names

6abbd80

change default back to basic

abbf4d6

fix test

0bc6898

jihoonson mentioned this pull request Jan 13, 2021

[Draft] 0.21.0 Release Notes #10752

Closed

suneet-s approved these changes Jan 14, 2021

View reviewed changes

jihoonson merged commit b3325c1 into apache:master Jan 14, 2021

Conversation

jihoonson commented Jan 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lgtm-com Bot commented Jan 7, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jihoonson commented Jan 14, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jihoonson commented Jan 7, 2021 •

edited

Loading