Skip to content

Coordinator Dynamic Config changes to ease upgrading with new config value#10724

Merged
suneet-s merged 7 commits intoapache:masterfrom
capistrant:fixes-10723
Jan 11, 2021
Merged

Coordinator Dynamic Config changes to ease upgrading with new config value#10724
suneet-s merged 7 commits intoapache:masterfrom
capistrant:fixes-10723

Conversation

@capistrant
Copy link
Copy Markdown
Contributor

@capistrant capistrant commented Dec 31, 2020

Fixes #10723

Description

Fixed the bug introduced in #10723

The mentioned change introduced undesirable behavior during upgrade. The new config value percentOfSegmentsToConsiderPerMove requires a value of 1 - 100. Default is 100. However on upgrade the coordinator will load the dynamic config from metastore using CoordinatorDynamicConfig for deserialization. Since the config will not have existed pre-upgrade, it will be null for the constructor which violates the precondition check on the value. This would cause the code to load the Builder default of CoordinatorDynamicConfig at startup. I believe this is undesirable. Instead we want Druid to load the existing persisted config with the default for the new config while logging a warn to the operator telling them their persisted config is out of date.


This PR has:

  • been self-reviewed.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster.

Key changed/added classes in this PR
  • CoordinatorDynamicConfig

@capistrant capistrant requested a review from suneet-s December 31, 2020 19:23
@capistrant
Copy link
Copy Markdown
Contributor Author

@suneet-s I have added you as reviewer since you were reviewer on initial PR.

Copy link
Copy Markdown
Contributor

@suneet-s suneet-s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@capistrant good catch! Let me know if you think it should remain a warn log instead of debug, otherwise LGTM

this.maxSegmentsToMove = maxSegmentsToMove;

if (percentOfSegmentsToConsiderPerMove == null) {
log.warn("percentOfSegmentsToConsiderPerMove was null! This is likely because your metastore does not "
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should be a warn log message. A user should not be expected to set an optional config. If the config is not set, it seems reasonable to me that the default is set to preserve the old behavior (ie all segments are considered per move).

For users who want to use the new behavior, I suspect they will hear about it in the release notes or the docs, and update the config to get the new functionality.

Suggested change
log.warn("percentOfSegmentsToConsiderPerMove was null! This is likely because your metastore does not "
log.debug("percentOfSegmentsToConsiderPerMove was null! This is likely because your metastore does not "

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya, I guess this is fine as a debug. It will actually automatically get set to the default (or the user specified value) the next time a user submits a new dynamic config since that code path uses builder which will populate missing values. will update now

Copy link
Copy Markdown
Contributor

@suneet-s suneet-s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@suneet-s
Copy link
Copy Markdown
Contributor

suneet-s commented Jan 6, 2021

@capistrant It looks like CI is complaining about a missing file for a newly added (I think) integration test 🤔 I'm not sure how this PR could have caused that failure - https://travis-ci.com/github/apache/druid/jobs/468930258 (I tried re-triggering this job a couple times just incase)

Could you maybe try merging master back into this PR to see if that will make Travis happy

@suneet-s suneet-s added the Bug label Jan 6, 2021
private final int maxSegmentsInNodeLoadingQueue;
private final boolean pauseCoordination;

private static final EmittingLogger log = new EmittingLogger(CoordinatorDynamicConfig.class);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I don't think this needs to be EmittingLogger since it isn't doing an alert anywhere

@JsonProperty("mergeSegmentsLimit") int mergeSegmentsLimit,
@JsonProperty("maxSegmentsToMove") int maxSegmentsToMove,
@JsonProperty("percentOfSegmentsToConsiderPerMove") double percentOfSegmentsToConsiderPerMove,
@JsonProperty("percentOfSegmentsToConsiderPerMove") Double percentOfSegmentsToConsiderPerMove,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the alternative way to fix this instead of switching out the primitive would be to treat 0 as not configured and so use the default of 100, though this way seems fine too, I think it just sticks out a bit because the rest of the numbery parameters are primitives.

also nit: should annotate this parameter with @Nullable

i missed the first PR, but any reason this is a double instead of integer similar to other percent based config decommissioningMaxPercentOfMaxSegmentsToMove? I guess there can be millions of segments so the numbers being dealt with here are a bit different (at least hopefully no one is trying to actually move on the scale of that many segments per coordinator run..) so maybe those decimal points really do make a difference, but otoh I can't really imagine operators configuring this much more granular than integer numbers, which I think was the reason we used integer for the other one iirc.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was because of my comment in the initial PR - #10284 (comment)

I was thinking about developer error in calculating the percentage

@suneet-s suneet-s added this to the 0.21.0 milestone Jan 8, 2021
@jihoonson
Copy link
Copy Markdown
Contributor

@capistrant the code changes LGTM, but could you add a unit test that verifies JSON deserialization of CoordinatorDynamicConfig with missing percentOfSegmentsToConsiderPerMove?

Copy link
Copy Markdown
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. +1 after CI. Thanks @capistrant!

@suneet-s suneet-s merged commit fe0511b into apache:master Jan 11, 2021
JulianJaffePinterest pushed a commit to JulianJaffePinterest/druid that referenced this pull request Jan 22, 2021
…value (apache#10724)

* Coordinator Dynamic Config changes to ease upgrading with new config value

* change a log to debug level following review

* changes based on review feedback

* fix checkstyle
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Exception in Coordinator if upgrading to Druid packages with percentOfSegmentsToConsiderPerMove implemented

4 participants