Fix the potential race between SplittableInputSource.getNumSplits() and SplittableInputSource.createSplits() in TaskMonitor#8924
Merged
gianm merged 4 commits intoapache:masterfrom Nov 23, 2019
Conversation
…tableInputSource.createSplits() in TaskMonitor
Member
clintropolis
left a comment
There was a problem hiding this comment.
this lgtm, but could you try to add a test to simulate a mismatch between estimated number of inputs and actual number of inputs processed where the outcome is still a success?
Contributor
Author
|
@clintropolis sounds good. Added unit tests. |
clintropolis
approved these changes
Nov 23, 2019
Member
clintropolis
left a comment
There was a problem hiding this comment.
lgtm 👍 thanks for adding the test
Contributor
Author
|
Probably the API change should be called out in the release notes. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
SplittableInputSourcehas two methods for parallel indexing.getNumSplits()returns the exact number of splits to process andcreateSplits()returns a stream ofInputSplits. Even though they are pretty tightly related to each other, it's not recommended to cache all splits in memory because it could be big if there are too many input splits. However, if it doesn't cache, the number of splits fromgetNumSplits()andcreateSplits()can be different because some files can be created or deleted between calls.getNumSplits()is currently used for 2 purposes in the parallel indexing task. First is to check that all subtasks of a phase has succeeded and second is to estimate the progress of each phase. This PR is to fix the bug above by changinggetNumSplits()togetEstimatedNumSplits().ParallelIndexingPhaseRunnerwill use the number ofSubTaskSpecs that it has iterated and compare it against the number of succeeded subtasks to determine the end of the phase. For the phase progress, it will just use the estimated number of total splits which could be wrong but doesn't that harm.This PR has:
This change is