Skip to content

Adjust HadoopIndexTask temp segment renaming to avoid potential race conditions#11075

Merged
jon-wei merged 25 commits intoapache:masterfrom
zachjsh:hadoop-segment-index-file-rename
Apr 21, 2021
Merged

Adjust HadoopIndexTask temp segment renaming to avoid potential race conditions#11075
jon-wei merged 25 commits intoapache:masterfrom
zachjsh:hadoop-segment-index-file-rename

Conversation

@zachjsh
Copy link
Copy Markdown
Contributor

@zachjsh zachjsh commented Apr 7, 2021

Description

Segment index file zips are now renamed from the index task, rather than in the hadoop reduce job. When index file renaming
occurs in the hadoop reduce job, it was found that at times, the final index file would get deleted because of a race condition
between between job retries.

Manually tested this using the hadoop ingest tutorial: https://druid.apache.org/docs/latest/tutorials/tutorial-batch-hadoop.html

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@zachjsh zachjsh requested a review from jon-wei April 8, 2021 19:22
{
boolean succeeded = job.run();

if (!config.getSchema().getTuningConfig().isLeaveIntermediate()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, would removing the cleanup here prevent HadoopDruidDetermineConfigurationJob from doing any necessary cleanup?

Copy link
Copy Markdown
Contributor Author

@zachjsh zachjsh Apr 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Fixed by adding a subsequent call to do cleanup in HadoopDruidDetermineConfigurationJob. Also found another place in CliInternalHadoopIndexer where we were maybe not doing necessary cleanup, and fixed in similar way.

@zachjsh zachjsh requested a review from jon-wei April 12, 2021 23:48

public class FileSystemHelper
{
public static FileSystem get(URI uri, Configuration conf) throws IOException
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class seems unnecessary

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed it for the test that I'm using it in. I wasnt able to mock the raw Filesystem.get routine, kept running into assist issues.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment here about that?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment

HadoopIngestionSpec indexerSchema)
{
HadoopDruidIndexerConfig config = HadoopDruidIndexerConfig.fromSpec(indexerSchema);
final Configuration configuration = JobHelper.injectSystemProperties(new Configuration(), config);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you could reuse this Configuration within the try block below

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

}

public static void maybeDeleteIntermediatePath(
boolean indexerGeneratorJobSucceeded,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should just be jobSucceeded since it's used by more than one job

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

);
}

public static void writeSegmentDescriptor(
Copy link
Copy Markdown
Contributor

@jon-wei jon-wei Apr 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method also deletes and creates a file, I think the descriptor creation should also be moved into the main task (it could be handled in renameIndexFile where you have access to a FileSystem). The mappers/reducers would produce a segment file at a temp location and the main task would handle the rename->create descriptor->publish flow

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We seem to be storing the information about the segments that we publish in the descriptor file, and then read the data written to this this file / directory in main task in order to know the list of segments that were produced. If we dont create this file in the sub task / job, how will we know what segments were created?

Copy link
Copy Markdown
Contributor Author

@zachjsh zachjsh Apr 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it so that the segment descriptor for a segment is not deleted, and overwritten if it already exists, as we discussed.

@zachjsh zachjsh requested a review from jon-wei April 16, 2021 22:34
@jon-wei
Copy link
Copy Markdown
Contributor

jon-wei commented Apr 20, 2021

Hm, we don't have any existing unit tests for HadoopIndexTask, so I think it'd be fine to ignore the coverage failure from that.

Can you run ITHadoopIndexTest locally and check if that passes?

@zachjsh
Copy link
Copy Markdown
Contributor Author

zachjsh commented Apr 21, 2021

Hm, we don't have any existing unit tests for HadoopIndexTask, so I think it'd be fine to ignore the coverage failure from that.

Can you run ITHadoopIndexTest locally and check if that passes?

Ran IT, everything passed.

@jon-wei jon-wei merged commit a2892d9 into apache:master Apr 21, 2021
@capistrant
Copy link
Copy Markdown
Contributor

capistrant commented Apr 22, 2021

Hm, we don't have any existing unit tests for HadoopIndexTask, so I think it'd be fine to ignore the coverage failure from that.

Can you run ITHadoopIndexTest locally and check if that passes?

I have a suspicion that this PR may have had an impact on jdk11 execution of Indexing Modules Test. @zachjsh were you able to pass tests locally with jdk11 by chance? If so, I could be wrong here

Edit: It seems to be powermock trying to access java internals from what I can tell

jon-wei added a commit to jon-wei/druid that referenced this pull request Apr 22, 2021
jon-wei added a commit that referenced this pull request Apr 22, 2021
@jon-wei
Copy link
Copy Markdown
Contributor

jon-wei commented Apr 22, 2021

@capistrant Thanks, I've reverted that patch

zachjsh added a commit to zachjsh/druid that referenced this pull request May 3, 2021
@zachjsh
Copy link
Copy Markdown
Contributor Author

zachjsh commented May 4, 2021

Hm, we don't have any existing unit tests for HadoopIndexTask, so I think it'd be fine to ignore the coverage failure from that.
Can you run ITHadoopIndexTest locally and check if that passes?

I have a suspicion that this PR may have had an impact on jdk11 execution of Indexing Modules Test. @zachjsh were you able to pass tests locally with jdk11 by chance? If so, I could be wrong here

Edit: It seems to be powermock trying to access java internals from what I can tell

thanks @capistrant! I believe the recent updates I've added fix the issue. I ran both the indexing-hadoop and indexing-service module unit tests with both java8 and java11 locally, and ran ITHadoopIndexTest locally with both java8 and java11. New pr can be found here #11194

zachjsh added a commit that referenced this pull request May 5, 2021
* Do stuff

* Do more stuff

* * Do more stuff

* * Do more stuff

* * working

* * cleanup

* * more cleanup

* * more cleanup

* * add license header

* * Add unit tests

* * add java docs

* * add more unit tests

* * Cleanup test

* * Move removing of workingPath to index task rather than in hadoop job.

* * Address review comments

* * remove unused import

* * Address review comments

* Do not overwrite segment descriptor for segment if it already exists.

* * add comments to FileSystemHelper class

* * fix local hadoop integration test

* * Fix failing test failures when running with java11

* Revert "Revert "Adjust HadoopIndexTask temp segment renaming to avoid potential race conditions (#11075)" (#11151)"

This reverts commit 49a9c3f.

* * remove JobHelperPowerMockTest

* * remove FileSystemHelper class
@clintropolis clintropolis added this to the 0.22.0 milestone Aug 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants