Skip to content

Conversation

@PawasChhokra
Copy link
Contributor

@PawasChhokra PawasChhokra commented Nov 20, 2020

Feature: The aim of this feature is to make all standby container requests rack aware such that all active containers and their corresponding standby containers are always on different racks. This helps with decreased downtime of applications during rack failures.

One of the requirements of this feature is that the value of job.standbytasks.replication.factor is at max 2 for the rack awareness functionality to be honored.

Changes: This PR uses the FaultDomainManager interface for Yarn to request for rack aware nodes while making standby container requests.

API Changes: None

Tests: Added

Upgrade instructions: TBD

Usage Instructions: For a job with host affinity and standby containers, set the config cluster-manager.fault-domain-aware.standby.enabled to true to enable this feature.

Copy link
Contributor

@lakshmi-manasa-g lakshmi-manasa-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first pass half way through. will come back to the other half in a bit.

Copy link
Contributor

@lakshmi-manasa-g lakshmi-manasa-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

finished one pass.

@mynameborat mynameborat changed the title [WIP] SAMZA-2605: Make Standby Container Requests Rack Aware SAMZA-2605: Make Standby Container Requests Rack Aware Dec 7, 2020
Copy link
Contributor

@mynameborat mynameborat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly agree to Manasa's comments. Added few major ones around FaultDomainManager interface and RackManager.

It is definitely looking much better than the previous version :)

* @param host the host
* @return the {@link FaultDomain}
*/
FaultDomain getFaultDomainOfNode(String host);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 use consistent terminologies.

Copy link
Contributor

@lakshmi-manasa-g lakshmi-manasa-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for addressing all the feedback.. just one clarification.

Copy link
Contributor

@mynameborat mynameborat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took another pass. Any reason why the new parameters do in the middle of the signature instead of end?

Copy link
Contributor

@mynameborat mynameborat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments around the configuration. Once you resolve them, can we update the configuration table docs to reflect details about these?

Specifically, which set of configurations are to be turned on in tandem for the feature to work and what are the defaults provided.

Copy link
Contributor

@mynameborat mynameborat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pushing through this. Few follow ups.
Looks good otherwise.

Copy link
Contributor

@lakshmi-manasa-g lakshmi-manasa-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a few minor comments.

@mynameborat
Copy link
Contributor

A few consistency related comments. Please resolve the conflicts as well.

@PawasChhokra PawasChhokra force-pushed the RackAwareContainerRequest branch from c5e52cf to b871543 Compare December 23, 2020 03:26
Copy link
Contributor

@mynameborat mynameborat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM. Can you update where you landed on adding the invariant of non-empty activeContainerId within issueStandByAwareAllocation method?

@PawasChhokra PawasChhokra force-pushed the RackAwareContainerRequest branch from c0960b3 to 8ed4575 Compare December 24, 2020 03:59
@mynameborat mynameborat merged commit 5748fa6 into apache:master Dec 24, 2020
lakshmi-manasa-g pushed a commit to lakshmi-manasa-g/samza that referenced this pull request Feb 9, 2021
Feature: The aim of this feature is to make all standby container requests rack aware such that all active containers and their corresponding standby containers are always on different racks. This helps with decreased downtime of applications during rack failures.

One of the requirements of this feature is that the value of job.standbytasks.replication.factor is at max 2 for the rack awareness functionality to be honored.

Changes: This PR uses the FaultDomainManager interface for Yarn to request for rack aware nodes while making standby container requests.

Usage Instructions: For a job with host affinity and standby containers, set the config cluster-manager.fault-domain-aware.standby.enabled to true to enable this feature.
tranjith pushed a commit to tranjith/samza that referenced this pull request Mar 23, 2021
Feature: The aim of this feature is to make all standby container requests rack aware such that all active containers and their corresponding standby containers are always on different racks. This helps with decreased downtime of applications during rack failures.

One of the requirements of this feature is that the value of job.standbytasks.replication.factor is at max 2 for the rack awareness functionality to be honored.

Changes: This PR uses the FaultDomainManager interface for Yarn to request for rack aware nodes while making standby container requests.

Usage Instructions: For a job with host affinity and standby containers, set the config cluster-manager.fault-domain-aware.standby.enabled to true to enable this feature.
shekhars-li pushed a commit to shekhars-li/samza that referenced this pull request May 28, 2021
Feature: The aim of this feature is to make all standby container requests rack aware such that all active containers and their corresponding standby containers are always on different racks. This helps with decreased downtime of applications during rack failures.

One of the requirements of this feature is that the value of job.standbytasks.replication.factor is at max 2 for the rack awareness functionality to be honored.

Changes: This PR uses the FaultDomainManager interface for Yarn to request for rack aware nodes while making standby container requests.

Usage Instructions: For a job with host affinity and standby containers, set the config cluster-manager.fault-domain-aware.standby.enabled to true to enable this feature.
shekhars-li pushed a commit to shekhars-li/samza that referenced this pull request May 28, 2021
…e#1446)"

This reverts commit e9e49af.

RB=2485434
BUG=LISAMZA-20154
G=samza-reviewers
R=calee,rmatharu
A=rmatharu
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants