Skip to content

Add pending task based resource management autoscaling strategy#2086

Merged
xvrl merged 1 commit intoapache:masterfrom
metamx:better-autoscaler
Apr 27, 2016
Merged

Add pending task based resource management autoscaling strategy#2086
xvrl merged 1 commit intoapache:masterfrom
metamx:better-autoscaler

Conversation

@nishantmonu51
Copy link
Copy Markdown
Member

Issues with current autoscaling :

  1. when lots of tasks are getting started, SimpleResourceManagementStrategy can wait too long to provision new nodes as it provisions 1 node at a time and then waits for that node to come online before starting another.
  2. SimpleResourceManagementStrategy maintains a targetWorkerCount based on the number of current running tasks. During an indexing service upgrade an entire copy of the RT cluster is made. This copy remains mostly idle for the first hour (or few hours) while tasks slowly get launched on it. This causes reluctance to do anything risky with real time nodes as upgrades become very expensive

This PR adds PendingTaskBasedAutoscalingStrategy as an attempt to resolve above two issues.
PendingTaskBasedAutoscalingStrategy takes into account the state of pending tasks, available capacity on existing nodes and the task assignment strategy to determine how many nodes to scale to.
During an upgrade, only minWorkerNodes (instead of duplicating complete cluster) will be created as the service is updated and after that nodes are added as new tasks are added to the queue.

The default resource management strategy is still be the old one which can be replaced by the new strategy once it gets tested well in production environments.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason we need a copy here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the interface for WorkerSelectStrategy explicitly needs ImmutableMap.I think its done in order to prevent any races due to addition or removal of workers.

@fjy
Copy link
Copy Markdown
Contributor

fjy commented Jan 7, 2016

@nishantmonu51 can you rebase from master to fix the transient failures?

@nishantmonu51
Copy link
Copy Markdown
Member Author

rebased.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest log.debug logging terminal condition.

@nishantmonu51 nishantmonu51 force-pushed the better-autoscaler branch 2 times, most recently from 75d0660 to a8d6c9f Compare January 29, 2016 08:32
@xvrl xvrl added this to the 0.9.0 milestone Feb 2, 2016
@fjy
Copy link
Copy Markdown
Contributor

fjy commented Feb 3, 2016

@nishantmonu51 @xvrl can we move this to 0.9.1 as we are hard pressed to find reviewers

@fjy fjy modified the milestones: 0.9.1, 0.9.0 Feb 4, 2016
@fjy fjy added the Improvement label Feb 6, 2016
@nishantmonu51
Copy link
Copy Markdown
Member Author

@xvrl @drcrallen: handled the review comments, please have a look again.

@Override
public void run()
{
doProvision(runner);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sanity note: errors here are caught by ScheduledExecutors implementaiton

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how much of this is copied from teh previous logic?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most of the common part is abstracted out in the abstract class.

@nishantmonu51 nishantmonu51 force-pushed the better-autoscaler branch 11 times, most recently from 19c5ab5 to b42e229 Compare April 12, 2016 14:07
@nishantmonu51
Copy link
Copy Markdown
Member Author

@xvrl I have tested this in our dogfood cluster and it seems to be working fine.

@fjy
Copy link
Copy Markdown
Contributor

fjy commented Apr 26, 2016

👍

review comments

review comments

review comments

fix compilation

fix compilation

fix ingestion

fix guide injection
@xvrl xvrl merged commit c29cb7d into apache:master Apr 27, 2016
@xvrl xvrl deleted the better-autoscaler branch April 27, 2016 17:40
);

if (want > 0 && currValidWorkers >= maxWorkerCount) {
log.warn("Unable to provision more workers. Current workerCount[%d] maximum workerCount[%d].");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing parameters

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants