Add pending task based resource management autoscaling strategy#2086
Add pending task based resource management autoscaling strategy#2086xvrl merged 1 commit intoapache:masterfrom
Conversation
4c92175 to
a3ef37c
Compare
There was a problem hiding this comment.
the interface for WorkerSelectStrategy explicitly needs ImmutableMap.I think its done in order to prevent any races due to addition or removal of workers.
|
@nishantmonu51 can you rebase from master to fix the transient failures? |
a3ef37c to
1a3fff3
Compare
|
rebased. |
There was a problem hiding this comment.
Suggest log.debug logging terminal condition.
1a3fff3 to
790a68a
Compare
75d0660 to
a8d6c9f
Compare
|
@nishantmonu51 @xvrl can we move this to 0.9.1 as we are hard pressed to find reviewers |
a8d6c9f to
e042739
Compare
|
@xvrl @drcrallen: handled the review comments, please have a look again. |
| @Override | ||
| public void run() | ||
| { | ||
| doProvision(runner); |
There was a problem hiding this comment.
Sanity note: errors here are caught by ScheduledExecutors implementaiton
There was a problem hiding this comment.
how much of this is copied from teh previous logic?
There was a problem hiding this comment.
most of the common part is abstracted out in the abstract class.
19c5ab5 to
b42e229
Compare
|
@xvrl I have tested this in our dogfood cluster and it seems to be working fine. |
|
👍 |
review comments review comments review comments fix compilation fix compilation fix ingestion fix guide injection
b42e229 to
84b9452
Compare
| ); | ||
|
|
||
| if (want > 0 && currValidWorkers >= maxWorkerCount) { | ||
| log.warn("Unable to provision more workers. Current workerCount[%d] maximum workerCount[%d]."); |
Issues with current autoscaling :
This PR adds PendingTaskBasedAutoscalingStrategy as an attempt to resolve above two issues.
PendingTaskBasedAutoscalingStrategy takes into account the state of pending tasks, available capacity on existing nodes and the task assignment strategy to determine how many nodes to scale to.
During an upgrade, only minWorkerNodes (instead of duplicating complete cluster) will be created as the service is updated and after that nodes are added as new tasks are added to the queue.
The default resource management strategy is still be the old one which can be replaced by the new strategy once it gets tested well in production environments.