Add a config for monitorScheduler type#10732
Conversation
|
This pull request introduces 1 alert when merging 6b791d1 into 48e576a - view on LGTM.com new alerts:
|
| public class DruidMonitorSchedulerConfig extends MonitorSchedulerConfig | ||
| { | ||
| @JsonProperty | ||
| private String schedulerClassName = ClockDriftSafeMonitorScheduler.class.getName(); |
There was a problem hiding this comment.
this changes the default to this monitor schedule type, is this ok/tested ?
There was a problem hiding this comment.
It was changed in #10448, not in this PR. This PR is just to make it configurable because I'm not sure how stable it is. As noted in #10448 (comment), CronScheduler seems to have a not-bad test coverage and worked well in my testing.
There was a problem hiding this comment.
I guess it will get tested in the next RC then.
There was a problem hiding this comment.
Yes, I have done some testing before and will do more.
There was a problem hiding this comment.
My 2¢: the best plan is to default to the old one, and then in the future (after some people have enabled the new one in production) we should switch to the new one, and remove the old one and remove the config entirely.
Rationale:
The new scheduler is designed to eliminate potential clock drift for monitors. This reward is real but is pretty small impact. I don't expect anything bad will happen if the schedule drifts a bit. The main risk of the new scheduler, I suppose, is that there's some case where it goes haywire, and either locks up completely or fires much more often than it should. I'm not sure how likely this is, but it's (a) hard to test for, (b) quite bad if it happens.
So, because the potential reward has a small impact, and the potential risk has a large impact, I think it's best to default to the old scheduler for another release or so. Just until such time as people have been able to do long-running tests in production and have found that there are no issues.
At any rate, it's good that this is undocumented, since it's an inside-baseball sort of config that we would only want to exist for a few releases.
There was a problem hiding this comment.
By the way, if anyone has been running the patch in production for a while already, now would be a good time to speak up. If we have already built up a good amount of confidence then I think it makes sense to default to the new one.
There was a problem hiding this comment.
I've also thought about this a bunch and have changed my opinion on whether or not we should change the scheduler to a new dependency by default a few times.
While changing to use the CronScheduler might fix a bug, it isn't clear whether any users have run into this in the field. I thought about documenting why a user would want to change the scheduler to CronScheduler instead of the older implementation, and I couldn't think of a good user facing reason to do so. So if we set the default to the old implementation, I don't think anyone would test it in production, so it would continue to live as dead code, and we'll have the same dilemma in the next release or 2 when we ask whether or not this has been run in production.
Setting the default to the older implementation reduces the impact of any bug that might show up in long running tests (even though this library was specifically built to fix issues found with long running processes). The drawback here is finding a reason for some users to try this in production so that we can sunset the feature flag in a release or 2.
Writing out this comment, I now think the more cautious approach - keeping the default the same - is better as it's hard to articulate the benefit for switching the scheduling and taking on the risk associated with changing the older behavior.
There was a problem hiding this comment.
So, because the potential reward has a small impact, and the potential risk has a large impact, I think it's best to default to the old scheduler for another release or so. Just until such time as people have been able to do long-running tests in production and have found that there are no issues.
This makes sense to me. I think we can do more extensive testing by ourselves instead of rushing to change the default.
There was a problem hiding this comment.
Changed the default back to BasicMonitorScheduler.
| @Override | ||
| public void run(long scheduledRunTimeMillis) | ||
| { | ||
| waitForScheduleFutureToBeSet(); |
There was a problem hiding this comment.
why not just use a CountDownLatch instead of continuously checking in continuous loop that counts down after scheduleFutureReference is set
There was a problem hiding this comment.
We can use it, but it seems not matter much to me since this loop is not supposed to run at all in production.
| TimeUnit.MILLISECONDS, | ||
| new CronTask() | ||
| { | ||
| private Future<?> scheduleFuture = null; |
There was a problem hiding this comment.
nit: I wonder if these can have better names: scheduledFuture, scheduleFuture, and scheduleFutureReference is a bit too close to each other and is sort of confusing at first glance. Should the inner one perhaps be called cancellationFuture or something to distinguish it from the external one?
There was a problem hiding this comment.
Ah, forgot to clean them up before commit. Thanks 👍
|
@clintropolis @pjain1 @suneet-s @gianm thanks for the review. |
* Add a config for monitorScheduler type * check interrupted * null check * do not schedule monitor if the previous one is still running * checkstyle * clean up names * change default back to basic * fix test
#10448 modified the
MonitorSchedulerto useCronSchedulerinstead ofScheduledExecutorService. This change looks good to me except that I'm not sure how well-testedCronScheduleris. This PR adds the previousScheduledExecutorService-basedMonitorSchedulerback, and a new config,druid.monitoring.schedulerClassName, to determine what type ofMonitorSchedulerto use. This PR also changes the defaultMonitorSchedulerback toBasicMonitorScheduler. Some brave users may want to explore the newClockDriftSafeMonitorScheduler. The new config is intentionally not documented as we will get rid of it once the new scheduler is proven to be safe. However, it should be called out in the release notes.This PR additionally fixes 3 bugs in
MonitorScheduler.monitor.monitor(), the behaviour has changed unexpectedly to stop the monitor. The monitor will ignore exceptions and continue working after this PR as it used to do.An additional change is that a new monitor task will not be scheduled if there is a previous one still running. For example, there is a monitor task which took 3 seconds to complete for some reason. In this case, there will be only one set of metrics emitted during these 3 seconds which are captured when a monitor task started. I think this is OK because we can avoid the growing queue in the scheduler even though some metrics can be missing.
This PR has: