Adjust HadoopIndexTask temp segment renaming to avoid potential race conditions by zachjsh · Pull Request #11075 · apache/druid

zachjsh · 2021-04-07T00:29:28Z

Description

Segment index file zips are now renamed from the index task, rather than in the hadoop reduce job. When index file renaming
occurs in the hadoop reduce job, it was found that at times, the final index file would get deleted because of a race condition
between between job retries.

Manually tested this using the hadoop ingest tutorial: https://druid.apache.org/docs/latest/tutorials/tutorial-batch-hadoop.html

This PR has:

been self-reviewed.
- using the concurrency checklist (Remove this item if the PR doesn't have any relation to concurrency.)
added documentation for new or modified features or behaviors.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added or updated version, license, or notice information in licenses.yaml
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
added integration tests.
been tested in a test Druid cluster.

…x-file-rename

jon-wei · 2021-04-10T00:48:44Z

  {
    boolean succeeded = job.run();
-
-    if (!config.getSchema().getTuningConfig().isLeaveIntermediate()) {


Hm, would removing the cleanup here prevent HadoopDruidDetermineConfigurationJob from doing any necessary cleanup?

Good catch! Fixed by adding a subsequent call to do cleanup in HadoopDruidDetermineConfigurationJob. Also found another place in CliInternalHadoopIndexer where we were maybe not doing necessary cleanup, and fixed in similar way.

…x-file-rename

jon-wei · 2021-04-13T02:02:18Z

+
+public class FileSystemHelper
+{
+  public static FileSystem get(URI uri, Configuration conf) throws IOException


This class seems unnecessary

I needed it for the test that I'm using it in. I wasnt able to mock the raw Filesystem.get routine, kept running into assist issues.

Can you add a comment here about that?

Added comment

jon-wei · 2021-04-13T04:03:55Z

+      HadoopIngestionSpec indexerSchema)
+  {
+    HadoopDruidIndexerConfig config = HadoopDruidIndexerConfig.fromSpec(indexerSchema);
+    final Configuration configuration = JobHelper.injectSystemProperties(new Configuration(), config);


I think you could reuse this Configuration within the try block below

jon-wei · 2021-04-13T04:04:41Z

+  }
+
+  public static void maybeDeleteIntermediatePath(
+      boolean indexerGeneratorJobSucceeded,


I think this should just be jobSucceeded since it's used by more than one job

jon-wei · 2021-04-13T04:37:18Z

+    );
  }

  public static void writeSegmentDescriptor(


this method also deletes and creates a file, I think the descriptor creation should also be moved into the main task (it could be handled in renameIndexFile where you have access to a FileSystem). The mappers/reducers would produce a segment file at a temp location and the main task would handle the rename->create descriptor->publish flow

We seem to be storing the information about the segments that we publish in the descriptor file, and then read the data written to this this file / directory in main task in order to know the list of segments that were produced. If we dont create this file in the sub task / job, how will we know what segments were created?

Changed it so that the segment descriptor for a segment is not deleted, and overwritten if it already exists, as we discussed.

…x-file-rename

jon-wei · 2021-04-20T05:16:13Z

Hm, we don't have any existing unit tests for HadoopIndexTask, so I think it'd be fine to ignore the coverage failure from that.

Can you run ITHadoopIndexTest locally and check if that passes?

…x-file-rename

zachjsh · 2021-04-21T09:02:17Z

Hm, we don't have any existing unit tests for HadoopIndexTask, so I think it'd be fine to ignore the coverage failure from that.

Can you run ITHadoopIndexTest locally and check if that passes?

Ran IT, everything passed.

capistrant · 2021-04-22T19:32:13Z

Hm, we don't have any existing unit tests for HadoopIndexTask, so I think it'd be fine to ignore the coverage failure from that.

Can you run ITHadoopIndexTest locally and check if that passes?

I have a suspicion that this PR may have had an impact on jdk11 execution of Indexing Modules Test. @zachjsh were you able to pass tests locally with jdk11 by chance? If so, I could be wrong here

Edit: It seems to be powermock trying to access java internals from what I can tell

…al race conditions (apache#11075)" This reverts commit a2892d9.

…al race conditions (#11075)" (#11151) This reverts commit a2892d9.

jon-wei · 2021-04-22T22:34:15Z

@capistrant Thanks, I've reverted that patch

… potential race conditions (apache#11075)" (apache#11151)" This reverts commit 49a9c3f.

zachjsh · 2021-05-04T00:49:26Z

Hm, we don't have any existing unit tests for HadoopIndexTask, so I think it'd be fine to ignore the coverage failure from that.
Can you run ITHadoopIndexTest locally and check if that passes?

I have a suspicion that this PR may have had an impact on jdk11 execution of Indexing Modules Test. @zachjsh were you able to pass tests locally with jdk11 by chance? If so, I could be wrong here

Edit: It seems to be powermock trying to access java internals from what I can tell

thanks @capistrant! I believe the recent updates I've added fix the issue. I ran both the indexing-hadoop and indexing-service module unit tests with both java8 and java11 locally, and ran ITHadoopIndexTest locally with both java8 and java11. New pr can be found here #11194

* Do stuff * Do more stuff * * Do more stuff * * Do more stuff * * working * * cleanup * * more cleanup * * more cleanup * * add license header * * Add unit tests * * add java docs * * add more unit tests * * Cleanup test * * Move removing of workingPath to index task rather than in hadoop job. * * Address review comments * * remove unused import * * Address review comments * Do not overwrite segment descriptor for segment if it already exists. * * add comments to FileSystemHelper class * * fix local hadoop integration test * * Fix failing test failures when running with java11 * Revert "Revert "Adjust HadoopIndexTask temp segment renaming to avoid potential race conditions (#11075)" (#11151)" This reverts commit 49a9c3f. * * remove JobHelperPowerMockTest * * remove FileSystemHelper class

zachjsh added 11 commits April 6, 2021 20:12

Do stuff

a7a6fe0

Do more stuff

dcd9663

* Do more stuff

fd26c73

* Do more stuff

ab45103

* working

78f2958

* cleanup

9f5bc94

* more cleanup

2f71523

* more cleanup

22061c3

* add license header

44f0707

* Add unit tests

ce168cf

* add java docs

2d41899

zachjsh requested a review from jon-wei April 8, 2021 19:22

zachjsh added 5 commits April 8, 2021 18:19

Merge remote-tracking branch 'apache/master' into hadoop-segment-inde…

063d3f8

…x-file-rename

* add more unit tests

0b4bfbd

* Cleanup test

c7fa3e8

* Move removing of workingPath to index task rather than in hadoop job.

cf26d35

Merge remote-tracking branch 'apache/master' into hadoop-segment-inde…

2dd3c56

…x-file-rename

jon-wei reviewed Apr 10, 2021

View reviewed changes

zachjsh added 3 commits April 12, 2021 12:15

* Address review comments

ed2e4ff

* remove unused import

3692fe6

Merge remote-tracking branch 'apache/master' into hadoop-segment-inde…

da82fa1

…x-file-rename

zachjsh requested a review from jon-wei April 12, 2021 23:48

jon-wei reviewed Apr 13, 2021

View reviewed changes

* Address review comments

219ceb9

clintropolis added Area - Batch Ingestion Bug labels Apr 13, 2021

zachjsh added 2 commits April 16, 2021 18:32

Do not overwrite segment descriptor for segment if it already exists.

8d0ef20

Merge remote-tracking branch 'apache/master' into hadoop-segment-inde…

b529b82

…x-file-rename

zachjsh requested a review from jon-wei April 16, 2021 22:34

zachjsh added 3 commits April 20, 2021 09:31

* add comments to FileSystemHelper class

883402a

Merge remote-tracking branch 'apache/master' into hadoop-segment-inde…

66713ee

…x-file-rename

* fix local hadoop integration test

de48276

jon-wei merged commit a2892d9 into apache:master Apr 21, 2021

jon-wei added a commit to jon-wei/druid that referenced this pull request Apr 22, 2021

Revert "Adjust HadoopIndexTask temp segment renaming to avoid potenti…

c03b85f

…al race conditions (apache#11075)" This reverts commit a2892d9.

jon-wei mentioned this pull request Apr 22, 2021

Revert "Adjust HadoopIndexTask temp segment renaming to avoid potential race conditions" #11151

Merged

jon-wei added a commit that referenced this pull request Apr 22, 2021

Revert "Adjust HadoopIndexTask temp segment renaming to avoid potenti…

49a9c3f

…al race conditions (#11075)" (#11151) This reverts commit a2892d9.

zachjsh added a commit to zachjsh/druid that referenced this pull request May 3, 2021

Revert "Revert "Adjust HadoopIndexTask temp segment renaming to avoid…

df77c4e

… potential race conditions (apache#11075)" (apache#11151)" This reverts commit 49a9c3f.

clintropolis added this to the 0.22.0 milestone Aug 12, 2021

Conversation

zachjsh commented Apr 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zachjsh Apr 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jon-wei Apr 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zachjsh Apr 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jon-wei commented Apr 20, 2021

Uh oh!

zachjsh commented Apr 21, 2021

Uh oh!

capistrant commented Apr 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jon-wei commented Apr 22, 2021

Uh oh!

zachjsh commented May 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zachjsh commented Apr 7, 2021 •

edited

Loading

zachjsh Apr 12, 2021 •

edited

Loading

jon-wei Apr 13, 2021 •

edited

Loading

zachjsh Apr 16, 2021 •

edited

Loading

capistrant commented Apr 22, 2021 •

edited

Loading

zachjsh commented May 4, 2021 •

edited

Loading