Skip to content

Add a config for monitorScheduler type#10732

Merged
jihoonson merged 9 commits intoapache:masterfrom
jihoonson:configurable-monitor-scheduler
Jan 14, 2021
Merged

Add a config for monitorScheduler type#10732
jihoonson merged 9 commits intoapache:masterfrom
jihoonson:configurable-monitor-scheduler

Conversation

@jihoonson
Copy link
Copy Markdown
Contributor

@jihoonson jihoonson commented Jan 7, 2021

#10448 modified the MonitorScheduler to use CronScheduler instead of ScheduledExecutorService. This change looks good to me except that I'm not sure how well-tested CronScheduler is. This PR adds the previous ScheduledExecutorService-based MonitorScheduler back, and a new config, druid.monitoring.schedulerClassName, to determine what type of MonitorScheduler to use. This PR also changes the default MonitorScheduler back to BasicMonitorScheduler. Some brave users may want to explore the new ClockDriftSafeMonitorScheduler. The new config is intentionally not documented as we will get rid of it once the new scheduler is proven to be safe. However, it should be called out in the release notes.

This PR additionally fixes 3 bugs in MonitorScheduler.

An additional change is that a new monitor task will not be scheduled if there is a previous one still running. For example, there is a monitor task which took 3 seconds to complete for some reason. In this case, there will be only one set of metrics emitted during these 3 seconds which are captured when a monitor task started. I think this is OK because we can avoid the growing queue in the scheduler even though some metrics can be missing.


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@lgtm-com
Copy link
Copy Markdown

lgtm-com Bot commented Jan 7, 2021

This pull request introduces 1 alert when merging 6b791d1 into 48e576a - view on LGTM.com

new alerts:

  • 1 for Dereferenced variable may be null

@jihoonson jihoonson added this to the 0.21.0 milestone Jan 8, 2021
public class DruidMonitorSchedulerConfig extends MonitorSchedulerConfig
{
@JsonProperty
private String schedulerClassName = ClockDriftSafeMonitorScheduler.class.getName();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this changes the default to this monitor schedule type, is this ok/tested ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was changed in #10448, not in this PR. This PR is just to make it configurable because I'm not sure how stable it is. As noted in #10448 (comment), CronScheduler seems to have a not-bad test coverage and worked well in my testing.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it will get tested in the next RC then.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have done some testing before and will do more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My 2¢: the best plan is to default to the old one, and then in the future (after some people have enabled the new one in production) we should switch to the new one, and remove the old one and remove the config entirely.

Rationale:

The new scheduler is designed to eliminate potential clock drift for monitors. This reward is real but is pretty small impact. I don't expect anything bad will happen if the schedule drifts a bit. The main risk of the new scheduler, I suppose, is that there's some case where it goes haywire, and either locks up completely or fires much more often than it should. I'm not sure how likely this is, but it's (a) hard to test for, (b) quite bad if it happens.

So, because the potential reward has a small impact, and the potential risk has a large impact, I think it's best to default to the old scheduler for another release or so. Just until such time as people have been able to do long-running tests in production and have found that there are no issues.

At any rate, it's good that this is undocumented, since it's an inside-baseball sort of config that we would only want to exist for a few releases.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, if anyone has been running the patch in production for a while already, now would be a good time to speak up. If we have already built up a good amount of confidence then I think it makes sense to default to the new one.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've also thought about this a bunch and have changed my opinion on whether or not we should change the scheduler to a new dependency by default a few times.

While changing to use the CronScheduler might fix a bug, it isn't clear whether any users have run into this in the field. I thought about documenting why a user would want to change the scheduler to CronScheduler instead of the older implementation, and I couldn't think of a good user facing reason to do so. So if we set the default to the old implementation, I don't think anyone would test it in production, so it would continue to live as dead code, and we'll have the same dilemma in the next release or 2 when we ask whether or not this has been run in production.

Setting the default to the older implementation reduces the impact of any bug that might show up in long running tests (even though this library was specifically built to fix issues found with long running processes). The drawback here is finding a reason for some users to try this in production so that we can sunset the feature flag in a release or 2.

Writing out this comment, I now think the more cautious approach - keeping the default the same - is better as it's hard to articulate the benefit for switching the scheduling and taking on the risk associated with changing the older behavior.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, because the potential reward has a small impact, and the potential risk has a large impact, I think it's best to default to the old scheduler for another release or so. Just until such time as people have been able to do long-running tests in production and have found that there are no issues.

This makes sense to me. I think we can do more extensive testing by ourselves instead of rushing to change the default.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the default back to BasicMonitorScheduler.

@Override
public void run(long scheduledRunTimeMillis)
{
waitForScheduleFutureToBeSet();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just use a CountDownLatch instead of continuously checking in continuous loop that counts down after scheduleFutureReference is set

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use it, but it seems not matter much to me since this loop is not supposed to run at all in production.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

TimeUnit.MILLISECONDS,
new CronTask()
{
private Future<?> scheduleFuture = null;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I wonder if these can have better names: scheduledFuture, scheduleFuture, and scheduleFutureReference is a bit too close to each other and is sort of confusing at first glance. Should the inner one perhaps be called cancellationFuture or something to distinguish it from the external one?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, forgot to clean them up before commit. Thanks 👍

@jihoonson
Copy link
Copy Markdown
Contributor Author

@clintropolis @pjain1 @suneet-s @gianm thanks for the review.

@jihoonson jihoonson merged commit b3325c1 into apache:master Jan 14, 2021
JulianJaffePinterest pushed a commit to JulianJaffePinterest/druid that referenced this pull request Jan 22, 2021
* Add a config for monitorScheduler type

* check interrupted

* null check

* do not schedule monitor if the previous one is still running

* checkstyle

* clean up names

* change default back to basic

* fix test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants