Add support for parallel native indexing with shuffle for perfect rollup by jihoonson · Pull Request #8257 · apache/druid

jihoonson · 2019-08-08T01:12:43Z

Part of #8061.

This PR is based on #8236.

Description

This PR adds support for parallel native indexing with shuffle for perfect rollup.

New configurations in `tuningConfig` for parallel index task:

forceGuaranteedRollup: parallel indexing is executed in two phases with shuffle if this is set.
maxNumSegmentsToMerge: Max limit for the number of segments that a single task can merge at the same time in the second phase. Used only forceGuaranteedRollup is set.
totalNumMergeTasks: Total number of tasks to merge segments in the second phase when forceGuaranteedRollup is set.

Refactoring

FiniteFirehoseProcessor is added to share the duplicate codes to process a firehose
ParallelIndexPhaseRunner is added to share the duplicate codes in the implementations of ParallelIndexTaskRunner
Added SubTaskReport interface to support different types of subtask reports

Multi-phase indexing

If forceGuaranteedRollup is set, ParallelIndexSupervisorTask runs PartialSegmentGenerateParallelIndexTaskRunner and PartialSegmentMergeParallelIndexTaskRunner for the first and second phases, respectively.

This PR has:

been self-reviewed.
- using the concurrency checklist (Remove this item if the PR doesn't have any relation to concurrency.)
added documentation for new or modified features or behaviors.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added unit tests or modified existing tests to cover new code paths.
been tested in a test Druid cluster.
- tested by ingesting the lineitem data of TPC-H@100GB

…lup.

…to superbatch-shuffle

…batch-shuffle

jon-wei · 2019-08-13T20:20:16Z

+each sub task creates segments individually and reports them to the supervisor task.
+
+If `forceGuaranteedRollup` = true, it's executed in two phases with data shuffle which is similar to [MapReduce](https://en.wikipedia.org/wiki/MapReduce).
+In the first phase, each sub task partitions input data based on `segmentGranularity` (primary partition key) in `granaulritySpec`


granaulritySpec -> granularitySpec

Thanks, fixed.

jon-wei · 2019-08-13T20:59:39Z

+  static boolean isGuaranteedRollup(IndexIOConfig ioConfig, IndexTuningConfig tuningConfig)
+  {
+    Preconditions.checkState(
+        !tuningConfig.isForceGuaranteedRollup() || !ioConfig.isAppendToExisting(),


hm, I wonder if this restriction is truly necessary (would it make sense if someone wants to append a perfectly-rolled up set of segments to a non-perfectly rolled up set?)

or, pls add a comment explaining why, technically, we can't append new perfectly rolled up segments to existing. Is this something not possible or just not enabled at this point.

Hmm, good point. I think, technically, we can append perfectly-rolled up set of segments to non-perfectly rolled up one even though some code should be fixed to support it (ex, HashBasedNumberedShardSpec always assumes that the start partitionId in the perfectly-rolled up set is 0). I'm not sure how useful it is though. Maybe it could be useful in some use cases. Added javadoc.

HashBasedNumberedShardSpec always assumes that the start partitionId in the perfectly-rolled up set is 0

Should you put some of the things that it would take to support it in the javadoc just in case it does become useful to save someone in the future some trouble?

Added some comment.

jon-wei · 2019-08-13T21:03:35Z

+      GranularitySpec granularitySpec,
+      IndexIOConfig ioConfig,
+      IndexTuningConfig tuningConfig,
+      PartitionsSpec nonNullPartitionsSpec


Suggest a @Nonnull annotation instead

Added the annotation, but kept the name as it is as well since it doesn't look harm.

himanshug · 2019-08-14T00:01:37Z

+
+    if (isGuaranteedRollup(ioConfig, tuningConfig)) {
+      // Overwrite mode, guaranteed rollup: shardSpecs must be known in advance.
+      assert nonNullPartitionsSpec instanceof HashedPartitionsSpec;


is this because we need to run job like DeterminePartitionsJob to figure out partitions otherwise, which will add another phase and avoided right now?

Hmm, the thing about DeterminePartitionsJob is correct, but this assertion is because the index task and the parallel index task currently only supports the hashed partitions spec. The range partitions spec will be supported as well and this assertion will be removed in the future.

Or, if you're asking about the comment, the index task already has a similar mode to determine partitions automatically and this method will not be called in that mode.

I wasn't asking about the comment but was trying to understand what limits us technically from using dimension partition spec . it seems my guess was correct. it would be nice to add that in the comment.

Added a comment.

jon-wei · 2019-08-13T23:52:03Z

-
        try {
-          if (eachSpec.isReady(toolbox.getTaskActionClient())) {
+          if (currentSubTaskHolder.setTask(eachSpec) && eachSpec.isReady(toolbox.getTaskActionClient())) {


Should currentSubTaskHolder.setTask(eachSpec) be handled separately, where the prevSpec == SPECIAL_VALUE_STOPPED check happened? With this change the log message about the task being intentionally stopped is gone and this loop would continue to check all task specs

Thanks, fixed.

jon-wei · 2019-08-14T01:35:30Z

+                    );
+                  }
+                }
+              } else if (taskMonitor.getNumRunningTasks() < maxNumTasks) {


What would trigger this if case? It looks like the subTaskSpecIterator would be exhausted before entering the while (isRunning()) loop and all the subtasks should've been submitted already

This can happen if the number of sub tasks to execute is larger than maxNumTasks which is "the max number of sub tasks which can be executed concurrently at the same time". Renamed it to maxNumConcurrentSubTasks and added a javadoc.

jon-wei · 2019-08-14T01:55:18Z

+          public void onFailure(Throwable t)
+          {
+            // this callback is called only when there were some problems in TaskMonitor.
+            LOG.error(t, "Error while running a task for subTaskSpec[%s]", spec);


Suggest revising the error log message to mention that it indicates issues with TaskMonitor specifically

Would you elaborate more on what you think should be mentioned? The Future returned from taskMonitor.submit() indicates the result of processing a sub task spec after a certain amount of task retries on failures.

I think the comments here could discuss how onFailure is triggered after task retries are exhausted across the set of subtasks, and onSuccess is called for individual subtask success

Added a javadoc to TaskMonitor.submit().

jon-wei · 2019-08-14T01:55:28Z

+          @Override
+          public void onSuccess(SubTaskCompleteEvent<SubTaskType> completeEvent)
+          {
+            // this callback is called if a task completed wheter it succeeded or not.


wheter -> whether

Thanks, fixed.

jon-wei · 2019-08-14T21:50:30Z

+                // We have more subTasks to run
+                submitNewTask(taskMonitor, subTaskSpecIterator.next());
+              } else {
+                // We have more subTasks to run, but don't have enough available task slots


It doesn't have to be done now, but I think it could be nice to have a warning log for cases where tasks couldn't be scheduled for X number of loop iterations or some other threshold, indicating that there may too much contention for free task slots in the cluster

himanshug · 2019-08-14T22:26:18Z

+   *
+   * @return the number of bytes copied
+   */
+  public static <T> long fetch(


nit: this could goto FileUtils class because, as dev, If I needed something like this, I would guess and look into FileUtils class to see if this code exists.

jon-wei · 2019-08-15T00:22:56Z


+  @GET
+  @Path("/phase")
+  @Produces(MediaType.APPLICATION_ATOM_XML)


Why is MediaType.APPLICATION_ATOM_XML used here?

Oops, should be MediaType.APPLICATION_JSON. Fixed.

jon-wei · 2019-08-15T00:36:41Z

  private static final Duration DEFAULT_CHAT_HANDLER_TIMEOUT = new Period("PT10S").toStandardDuration();
  private static final int DEFAULT_CHAT_HANDLER_NUM_RETRIES = 5;
+  private static final int DEFAULT_MAX_NUM_SEGMENTS_TO_MERGE = 100;
+  private static final int DEFAULT_TOTAL_NUM_MERGE_TASKS = 10;


Could consider making the default number of merge tasks the same as the effective value for maxNumSubTasks instead

maxNumSubTasks is actually the number of max number of sub tasks that can be run concurrently at the same time. Raised #8318.

jon-wei · 2019-08-15T01:39:06Z

+  {
+    return URI.create(
+        StringUtils.format(
+            "http://%s:%d/druid/worker/v1/shuffle/task/%s/%s/partition?startTime=%s&endTime=%s&partitionId=%d",


What happens if TLS is enabled and plaintext port is disabled on the data server?

Good point! Fixed to use https if TLS is enabled.

jon-wei · 2019-08-15T01:39:52Z

+
+    final List<PartialSegmentMergeIOConfig> assignedPartitionLocations = new ArrayList<>(numMergeTasks);
+    for (int i = 0; i < numMergeTasks - 1; i++) {
+      final List<PartitionLocation> assingedToSameTask = partitions


assingedToSameTask -> assignedToSameTask, as well as below

Thanks, fixed.

jon-wei · 2019-08-15T01:48:07Z

+          }
+        } else {
+          // If it's still running, update last access time.
+          supervisorTaskCheckTimes.put(supervisorTaskId, DateTimes.nowUtc());


Should this and possibly other access time updates use DateTimes.nowUtc().plus(intermediaryPartitionTimeout) instead?

👍 fixed.

himanshug · 2019-08-15T02:50:54Z

I read through this PR and looked LGTM to me overall. thumbs up for testing it with a large dataset

currently, user needs to set totalNumMergeTasks which by default is set to a static value of 10. as a follow up, it would be great to change the default value dynamically based on size of data, number of partitions, number of segment intervals etc ... so that default behavior just works for most cases.

clintropolis

overall lgtm, 👍

clintropolis · 2019-08-15T07:31:22Z

+  /**
+   * Max number of segments to merge at the same time.
+   * Used only by {@link PartialSegmentMergeTask}.
+   * This configuration was temporally added to avoid using too much memory while merging segments,


nit: I think maybe 'temporarily' is more appropriate word here

Thanks, fixed.

clintropolis · 2019-08-15T07:31:33Z

-
-      final SegmentsAndMetadata pushed = driver.pushAllAndClear(pushTimeout);
-      log.info("Pushed segments[%s]", pushed.getSegments());
+      final FiniteFirehoseProcessor firehoseProcessor = new FiniteFirehoseProcessor(


this is nice 👍

clintropolis · 2019-08-15T07:32:26Z

+  static boolean isGuaranteedRollup(IndexIOConfig ioConfig, IndexTuningConfig tuningConfig)
+  {
+    Preconditions.checkState(
+        !tuningConfig.isForceGuaranteedRollup() || !ioConfig.isAppendToExisting(),


HashBasedNumberedShardSpec always assumes that the start partitionId in the perfectly-rolled up set is 0

Should you put some of the things that it would take to support it in the javadoc just in case it does become useful to save someone in the future some trouble?

clintropolis · 2019-08-15T07:56:42Z

+    if (mergedFiles.size() == 1) {
+      return Pair.of(mergedFiles.get(0), Preconditions.checkNotNull(dimensionNames, "dimensionNames"));
+    } else {
+      return mergeSegmentsInSamePartition(


I guess the sizes involved here make the recursion not an issue 😅?

Yes, I think the level of recursion shouldn't be very high.

jihoonson · 2019-08-15T20:30:43Z

Thank you for the review.

currently, user needs to set totalNumMergeTasks which by default is set to a static value of 10. as a follow up, it would be great to change the default value dynamically based on size of data, number of partitions, number of segment intervals etc ... so that default behavior just works for most cases.

Yes, the number of merge tasks can be automatically computed in the future. PartitionStat was added to collect some statistics which could be useful for such computation. I think it would be pretty cool if we support it, but not sure when I can work on it yet. Will try to do it before 0.17.

clintropolis

👍 after CI

jon-wei

+1 after CI

jihoonson · 2019-08-16T00:43:26Z

@jon-wei @himanshug @clintropolis thank you for the review!

jihoonson added 14 commits August 3, 2019 13:04

Add TaskResourceCleaner; fix a couple of concurrency bugs in batch tasks

5a98275

kill runner when it's ready

77e2128

add comment

31faec2

kill run thread

13fd909

fix test

befedf6

Take closeable out of Appenderator

20a502d

add javadoc

6dbbf71

fix test

4404c2e

fix test

c13ebeb

update javadoc

a89b3fc

add javadoc about killed task

76f75fa

address comment

e430747

Add support for parallel native indexing with shuffle for perfect rol…

01f6679

…lup.

Add comment about volatiles

1f314fb

jihoonson added Area - Batch Ingestion Release Notes labels Aug 8, 2019

jihoonson added 9 commits August 7, 2019 21:00

fix test

895c158

fix test

ba0a4e5

handling missing exceptions

cff9465

more clear javadoc for stopGracefully

2e19f5b

unused import

a1e27d4

update javadoc

6752994

Add missing statement in javadoc

97c0f12

Merge branch 'task-resource-cleaner' of github.com:jihoonson/druid in…

52fe75f

…to superbatch-shuffle

Merge branch 'master' of github.com:apache/incubator-druid into super…

d531a88

…batch-shuffle

jon-wei reviewed Aug 13, 2019

View reviewed changes

himanshug reviewed Aug 14, 2019

View reviewed changes

jon-wei reviewed Aug 14, 2019

View reviewed changes

address comments; fix doc

8ddf3cc

add javadoc for isGuaranteedRollup

2024fb2

jon-wei reviewed Aug 14, 2019

View reviewed changes

Rename confusing variable name and fix typos

68e118d

jon-wei reviewed Aug 14, 2019

View reviewed changes

himanshug reviewed Aug 14, 2019

View reviewed changes

jon-wei reviewed Aug 15, 2019

View reviewed changes

clintropolis reviewed Aug 15, 2019

View reviewed changes

jihoonson added 2 commits August 15, 2019 13:03

fix typos; move fetch() to a better home; fix the expiration time

64c5837

add support https

62528b0

clintropolis approved these changes Aug 15, 2019

View reviewed changes

jon-wei approved these changes Aug 15, 2019

View reviewed changes

jihoonson merged commit 5dac637 into apache:master Aug 16, 2019

clintropolis added this to the 0.16.0 milestone Aug 23, 2019

jihoonson mentioned this pull request Nov 6, 2019

Parallel indexing task fails during the rolling update #8836

Closed

anshbansal mentioned this pull request Nov 12, 2019

Ingestion task fails with NullPointerException during BUILD_SEGMENTS phase #8835

Open

Conversation

jihoonson commented Aug 8, 2019

Description

New configurations in tuningConfig for parallel index task:

Refactoring

Multi-phase indexing

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jon-wei Aug 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jon-wei Aug 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

New configurations in `tuningConfig` for parallel index task:

jon-wei Aug 14, 2019 •

edited

Loading

jon-wei Aug 14, 2019 •

edited

Loading