Add pending task based resource management autoscaling strategy by nishantmonu51 · Pull Request #2086 · apache/druid

nishantmonu51 · 2015-12-14T09:22:00Z

Issues with current autoscaling :

when lots of tasks are getting started, SimpleResourceManagementStrategy can wait too long to provision new nodes as it provisions 1 node at a time and then waits for that node to come online before starting another.
SimpleResourceManagementStrategy maintains a targetWorkerCount based on the number of current running tasks. During an indexing service upgrade an entire copy of the RT cluster is made. This copy remains mostly idle for the first hour (or few hours) while tasks slowly get launched on it. This causes reluctance to do anything risky with real time nodes as upgrades become very expensive

This PR adds PendingTaskBasedAutoscalingStrategy as an attempt to resolve above two issues.
PendingTaskBasedAutoscalingStrategy takes into account the state of pending tasks, available capacity on existing nodes and the task assignment strategy to determine how many nodes to scale to.
During an upgrade, only minWorkerNodes (instead of duplicating complete cluster) will be created as the service is updated and after that nodes are added as new tasks are added to the queue.

The default resource management strategy is still be the old one which can be replaced by the new strategy once it gets tested well in production environments.

xvrl · 2016-01-07T05:57:39Z

any reason we need a copy here?

the interface for WorkerSelectStrategy explicitly needs ImmutableMap.I think its done in order to prevent any races due to addition or removal of workers.

fjy · 2016-01-07T06:00:20Z

@nishantmonu51 can you rebase from master to fix the transient failures?

nishantmonu51 · 2016-01-07T14:30:01Z

rebased.

drcrallen · 2016-01-13T17:12:10Z

Suggest log.debug logging terminal condition.

fjy · 2016-02-03T23:07:45Z

@nishantmonu51 @xvrl can we move this to 0.9.1 as we are hard pressed to find reviewers

nishantmonu51 · 2016-02-16T15:53:14Z

@xvrl @drcrallen: handled the review comments, please have a look again.

drcrallen · 2016-02-22T18:24:15Z

+            @Override
+            public void run()
+            {
+              doProvision(runner);


Sanity note: errors here are caught by ScheduledExecutors implementaiton

fjy · 2016-03-31T00:53:26Z

how much of this is copied from teh previous logic?

most of the common part is abstracted out in the abstract class.

nishantmonu51 · 2016-04-18T16:42:33Z

@xvrl I have tested this in our dogfood cluster and it seems to be working fine.

fjy · 2016-04-26T23:59:37Z

👍

review comments review comments review comments fix compilation fix compilation fix ingestion fix guide injection

drcrallen · 2016-04-28T22:28:26Z

+    );
+
+    if (want > 0 && currValidWorkers >= maxWorkerCount) {
+      log.warn("Unable to provision more workers. Current workerCount[%d] maximum workerCount[%d].");


Missing parameters

nishantmonu51 force-pushed the better-autoscaler branch from 4c92175 to a3ef37c Compare December 15, 2015 14:42

xvrl reviewed Jan 7, 2016
View reviewed changes

nishantmonu51 force-pushed the better-autoscaler branch from a3ef37c to 1a3fff3 Compare January 7, 2016 14:29

drcrallen reviewed Jan 13, 2016
View reviewed changes

nishantmonu51 force-pushed the better-autoscaler branch from 1a3fff3 to 790a68a Compare January 20, 2016 15:14

nishantmonu51 force-pushed the better-autoscaler branch 2 times, most recently from 75d0660 to a8d6c9f Compare January 29, 2016 08:32

xvrl added this to the 0.9.0 milestone Feb 2, 2016

fjy modified the milestones: 0.9.1, 0.9.0 Feb 4, 2016

fjy added the Improvement label Feb 6, 2016

nishantmonu51 force-pushed the better-autoscaler branch from a8d6c9f to e042739 Compare February 16, 2016 15:50

drcrallen reviewed Feb 22, 2016
View reviewed changes

fjy reviewed Mar 31, 2016
View reviewed changes

nishantmonu51 force-pushed the better-autoscaler branch 11 times, most recently from 19c5ab5 to b42e229 Compare April 12, 2016 14:07

add pending task based resource management strategy

84b9452

review comments review comments review comments fix compilation fix compilation fix ingestion fix guide injection

nishantmonu51 force-pushed the better-autoscaler branch from b42e229 to 84b9452 Compare April 27, 2016 16:10

xvrl merged commit c29cb7d into apache:master Apr 27, 2016

xvrl deleted the better-autoscaler branch April 27, 2016 17:40

drcrallen reviewed Apr 28, 2016
View reviewed changes

clambertus unassigned fjy Jul 6, 2018

Conversation

nishantmonu51 commented Dec 14, 2015

Uh oh!

xvrl Jan 7, 2016

Choose a reason for hiding this comment

Uh oh!

nishantmonu51 Jan 7, 2016

Choose a reason for hiding this comment

Uh oh!

fjy commented Jan 7, 2016

Uh oh!

nishantmonu51 commented Jan 7, 2016

Uh oh!

drcrallen Jan 13, 2016

Choose a reason for hiding this comment

Uh oh!

fjy commented Feb 3, 2016

Uh oh!

nishantmonu51 commented Feb 16, 2016

Uh oh!

drcrallen Feb 22, 2016

Choose a reason for hiding this comment

Uh oh!

fjy Mar 31, 2016

Choose a reason for hiding this comment

Uh oh!

nishantmonu51 Apr 6, 2016

Choose a reason for hiding this comment

Uh oh!

nishantmonu51 commented Apr 18, 2016

Uh oh!

fjy commented Apr 26, 2016

Uh oh!

drcrallen Apr 28, 2016

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants