Parallel indexing single dim partitions by ccaominh · Pull Request #8925 · apache/druid

ccaominh · 2019-11-22T02:58:25Z

Description

Implements single dimension range partitioning for native parallel batch indexing as described in #8769. This initial version requires the druid-datasketches extension to be added to the classpath.

The algorithm has 5 phases that are orchestrated by the supervisor in ParallelIndexSupervisorTask#runRangePartitionMultiPhaseParallel(). These phases and the main classes involved are described below:

In parallel, determine the distribution of dimension values for each input source split.

PartialDimensionDistributionTask uses StringSketch to generate the approximate distribution of dimension values for each input source split. If the rows are ungrouped, PartialDimensionDistributionTask.UngroupedRowDimensionValueFilter uses a Bloom filter to skip rows that would be grouped. The final distribution is sent back to the supervisor via DimensionDistributionReport.
The range partitions are determined.

In ParallelIndexSupervisorTask#determineAllRangePartitions(), the supervisor uses StringSketchMerger to merge the individual StringSketches created in the preceding phase. The merged sketch is then used to create the range partitions.
In parallel, generate partial range-partitioned segments.

PartialRangeSegmentGenerateTask uses the range partitions determined in the preceding phase and RangePartitionCachingLocalSegmentAllocator to generate SingleDimensionShardSpecs. The partition information is sent back to the supervisor via GeneratedGenericPartitionsReport.
The partial range segments are grouped.

In ParallelIndexSupervisorTask#groupGenericPartitionLocationsPerPartition(), the supervisor creates the PartialGenericSegmentMergeIOConfigs necessary for the next phase.
In parallel, merge partial range-partitioned segments.

PartialGenericSegmentMergeTask uses GenericPartitionLocation to retrieve the partial range-partitioned segments generated earlier and then merges and publishes them.

This PR has:

been self-reviewed.
added documentation for new or modified features or behaviors.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths.
added integration tests.

Key changed/added classes in this PR

PartialDimensionDistributionTask and PartialDimensionDistributionTaskTest
RangePartitionCachingLocalSegmentAllocator and RangePartitionCachingLocalSegmentAllocatorTest
RangePartitionMultiPhaseParallelIndexingTest

Implements single dimension range partitioning for native parallel batch indexing as described in apache#8769. This initial version requires the druid-datasketches extension to be loaded. The algorithm has 5 phases that are orchestrated by the supervisor in `ParallelIndexSupervisorTask#runRangePartitionMultiPhaseParallel()`. These phases and the main classes involved are described below: 1) In parallel, determine the distribution of dimension values for each input source split. `PartialDimensionDistributionTask` uses `StringSketch` to generate the approximate distribution of dimension values for each input source split. If the rows are ungrouped, `PartialDimensionDistributionTask.UngroupedRowDimensionValueFilter` uses a Bloom filter to skip rows that would be grouped. The final distribution is sent back to the supervisor via `DimensionDistributionReport`. 2) The range partitions are determined. In `ParallelIndexSupervisorTask#determineAllRangePartitions()`, the supervisor uses `StringSketchMerger` to merge the individual `StringSketch`es created in the preceding phase. The merged sketch is then used to create the range partitions. 3) In parallel, generate partial range-partitioned segments. `PartialRangeSegmentGenerateTask` uses the range partitions determined in the preceding phase and `RangePartitionCachingLocalSegmentAllocator` to generate `SingleDimensionShardSpec`s. The partition information is sent back to the supervisor via `GeneratedGenericPartitionsReport`. 4) The partial range segments are grouped. In `ParallelIndexSupervisorTask#groupGenericPartitionLocationsPerPartition()`, the supervisor creates the `PartialGenericSegmentMergeIOConfig`s necessary for the next phase. 5) In parallel, merge partial range-partitioned segments. `PartialGenericSegmentMergeTask` uses `GenericPartitionLocation` to retrieve the partial range-partitioned segments generated earlier and then merges and publishes them.

jihoonson

Still reviewing.

jihoonson · 2019-11-22T21:51:09Z

+For perfect rollup, you should use either `hashed` (partitioning based on the hash of dimensions in each row) or
+`single_dim` (based on ranges of a single dimension. For best-effort rollup, you should use `dynamic`.
+
+Hashed partitioning is recommended in most cases, as it will improve indexing performance and create more uniformly


I'm not sure how hashed partitioning can improve indexing performance or create more uniformly sized data segments relative to dynamic partitioning. With dynamic partitioning, the parallel indexing task will run in a single phase mode whereas hash-based partitioning requires to run in two phases mode. Also, the uniformity in segment size with hashed partitioning will depend on the partition key distribution whereas dynamic partitioning guarantees a max size for segments. Am I missing something?

I've reworded it to only recommend hash partitioning over single dim partitioning for perfect rollup.

jihoonson · 2019-11-22T22:21:08Z


+  private TaskStatus runRangePartitionMultiPhaseParallel(TaskToolbox toolbox) throws Exception
+  {
+    assertDataSketchesAvailable();


This method will be called after the task is assigned to a middleManager or an indexer which could be resource and time wasting if the datasketch extension is not loaded. I think it's not common to load datasketch in middleManagers but not in the overlord. Can we check this in isReady()?

Moved to isReady()

jihoonson · 2019-11-22T22:22:21Z

+      return TaskStatus.failure(getId());
+    }
+
+    Map<Interval, String[]> intervalToPartitions =


Suggest to create a concrete class rather than using Map. It will be more intuitive and clear to understand.

I think it makes more sense to add a class wrapping String[] rather than the map.

I've created a class Partitions and used it to replace usages of String[]

jihoonson · 2019-11-22T22:24:45Z

+            <groupId>org.apache.logging.log4j</groupId>
+            <artifactId>log4j-api</artifactId>
+        </dependency>
+        <dependency>


Would you please add a comment where datasketches are used?

Added a comment

jihoonson · 2019-11-22T23:20:59Z

+                metricsNames
+            ),
+            inputFormat,
+            null


Shouldn't be null. toolbox.getIndexingTmpDir()?

jihoonson · 2019-11-23T00:25:19Z

+      String minDimensionValue = intervalToMinDimensionValue.get(interval);
+      if (minDimensionValue == null || dimensionValue.compareTo(minDimensionValue) < 0) {
+        intervalToMinDimensionValue.put(interval, dimensionValue);
+      }


super nit: you can use compute. It's better in a sense that it computes hash code only one time, but I don't believe that matters here.

intervalToMinDimensionValue.compute( interval, (intervalKey, currentMinValue) -> { if (currentMinValue == null || dimensionValue.compareTo(currentMinValue) < 0) { return dimensionValue; } else { return currentMinValue; } } );

Changed this method and updateMaxDimensionValue to use Map.compute()

jihoonson · 2019-11-23T01:27:15Z

+  {
+    Map<Interval, StringDistribution> intervalToDistribution = new HashMap<>();
+    DimensionValueFilter dimValueFilter =
+        isAssumeGrouped


Looks like it should check rollup is true or not.

Added a check for granularitySpec.isRollup()

jihoonson · 2019-11-23T01:28:08Z

+  }
+
+  @VisibleForTesting
+  static class UngroupedRowDimensionValueFilter implements DimensionValueFilter


Would you please add a javadoc what rows this will filter out?

Added javadoc

jihoonson · 2019-11-23T01:30:42Z

+public interface StringDistribution
+{
+  /**
+   * Record occurence of {@link String}


type: occurence -> occurrence

jihoonson · 2019-11-23T02:44:47Z

@@ -63,19 +66,21 @@ interface IntervalToSegmentIdsCreator
  CachingLocalSegmentAllocator(


Hmm maybe this should be renamed to something else since it's a bit strange that callers don't extends this class directly which I think I see why now.

Renamed to CachingLocalSegmentAllocatorHelper

jihoonson · 2019-11-26T03:41:58Z

+      //noinspection ResultOfObjectAllocationIgnored
+      new StringSketch();
+    }
+    catch (Throwable t) {


I think this should catch the particular type of error. I guess it's ClassNotFoundException to be caught?

Changed to catch NoClassDefFoundError.

jihoonson · 2019-11-26T04:23:55Z

+      return Collections.emptyList();
+    }
+
+    String[] uniquePartitions = Arrays.stream(partitions).distinct().toArray(String[]::new);


Can partitions have duplicate values? It seems to get from StringSketch.getEventPartitionsByCount() which shouldn't return duplicates. If this shouldn't have duplicates, I would suggest to add a sanity check instead of deduplicating them.

Currently, StringSketch.getEventPartitionsByCount() can return duplicate values since it's a wrapper for ItemsSketch.getQuantiles() (i.e., if the distribution has many duplicates, adjacent quantiles may have the same value).

I'll also fix the typo in "getEventPartitionsByCount".

jihoonson · 2019-11-26T04:33:38Z

+
+    if (isLastPartitionOnlyMaxValue(partitions)) {
+      // The last partition only contains the max value. A shard that just contains the max value is likely to be
+      // small, so combine it with the second to last one.


Hmm, I'm not sure this assumption makes sense. The indexing might be a bit better if the assumption holds, but if not, the indexing speed could get worse significantly by the skewed data distribution. I think we shouldn't depend on this kind of assumption. Instead, we can utilize PartitionStat once we collect it properly in the future.

Or probably we should collect the number of elements in each partition together. Does sketch provide such functionality?

The sketch does not have an API that does that. I've remove the logic to combine the last two partitions.

jihoonson · 2019-11-26T04:35:24Z

+                     uniquePartitions[i + 1],
+                     i
+                 ))
+                 .collect(Collectors.toCollection(ArrayList::new));


Can be simplified into Collectors.toList().

The javadoc for Collectors.toList() states:

There are no guarantees on the type, mutability, serializability, or thread-safety of the List returned

Since the returned List needs to be mutated, I specified the list implementation used by the Collector.

jihoonson · 2019-11-26T04:36:14Z

+    return intervalToSegmentIds;
+  }
+
+  private List<SegmentIdWithShardSpec> translatePartitions(


Please add a Javadoc since it's not easy to guess what this method does from its name.

jihoonson · 2019-11-26T06:12:12Z

+ *
+ * @see PartialHashSegmentMergeParallelIndexTaskRunner
+ */
+class PartialRangeSegmentGenerateParallelIndexTaskRunner


Similarly, suggest PartialRangePartitionedSegmentGenerateRunner.

I'll leave the current name for now

Do you mean you're planning to do it later? Would you please open an issue for it then?

jihoonson · 2019-11-26T06:15:39Z

+{
+  static final String NAME = "sketch";
+  static final int SKETCH_K = 1 << 12;  // smallest value with normalized rank error < 0.1%; retain up to ~86k elements
+  static final Comparator<String> SKETCH_COMPARATOR = Comparator.naturalOrder();


nit: this doesn't necessarily have to be done in this PR, but it would be nice if it supports different orderings for the string type.

I like the idea, but think it'll be best to do it in a subsequent PR.

jihoonson · 2019-11-26T07:00:10Z

 public abstract class IngestionTestBase
 {
+  static {
+    NullHandling.initializeForTests();


Please extend InitializedNullHandlingTest instead of this.

jihoonson · 2019-11-26T07:54:44Z

+  @Override
+  ShardSpec createShardSpec(TaskToolbox toolbox, Interval interval, int partitionNum)
+  {
+    return createIntervalAndIntegerToShardSpec.get(interval, partitionNum);


This should fail if there is no shardSpec for the given interval and partitionNum.

Added a Precondition

jihoonson · 2019-11-26T07:56:03Z

+  }
+
+  @Override
+  ShardSpec createShardSpec(TaskToolbox toolbox, Interval interval, int partitionNum)


Suggest to use partitionId instead of partitionNum because I think it's more clear. Same for other places.

jihoonson · 2019-11-27T03:41:39Z

+    List<String> metricsNames = Arrays.stream(dataSchema.getAggregators())
+                                      .map(AggregatorFactory::getName)
+                                      .collect(Collectors.toList());
+    InputFormat inputFormat = ParallelIndexSupervisorTask.getInputFormat(ingestionSchema);


Should be inputSource.needsFormat() ? ParallelIndexSupervisorTask.getInputFormat(ingestionSchema) : null.

…partitioning

jihoonson

A task with the range partitioning can fail if the set of input files has changed between the distribution investigation phase and the partial segment generation phase because CachingLocalSegmentAllocatorHelper.allocate() will return null for unknown sequence names. Even though it also looks good to have an option to continue indexing instead of failing, I think it's ok for now to always fail. Would you please call out this in the document as well?

jihoonson · 2019-12-05T19:53:36Z

+For perfect rollup, you should use either `hashed` (partitioning based on the hash of dimensions in each row) or
+`single_dim` (based on ranges of a single dimension. For best-effort rollup, you should use `dynamic`.
+
+For perfect rollup, `hashed` partitioning is recommended in most cases, as it will improve indexing


I think it's worth to clearly mention what are the pros/cons of using each partitions spec instead of promoting using hashed partitioning.

With dynamic partitioning, you can expect the fastest ingestion speed compared to when using other partitions specs. It also always guarantees a well-balanced distribution in segment size.

With hashed, your wording is correct.

With single_dim, its partitioning can be skewed depending on the partition key, but the broker can use of the partition information to prune segments to query earlier. If the query has a filter on the partition key column, the broker can filter out segments which have only the values not satisfying the filter.

Added a section describing the pros/cons.

jihoonson · 2019-12-05T19:56:05Z

+
+> Single-dimension range partitioning currently requires the
+> [druid-datasketches](../development/extensions-core/datasketches-extension.md)
+> extension to be added to the classpath.


How about extension to be [loaded](https://druid.apache.org/docs/0.16.0-incubating/development/extensions.html#loading-extensions)?

Added a link to loading the extension from the classpath: https://druid.apache.org/docs/latest/development/extensions.html#loading-extensions-from-the-classpath

Also added a warning here about possible errors if the input changes during the two passes over the input.

jihoonson · 2019-12-05T21:34:40Z

+    int numUniquePartition = uniquePartitions.length;
+
+    // First partition starts with null (see StringPartitionChunk.isStart())
+    uniquePartitions[0] = null;


It looks like the last value also needs to be null. Also please add a comment why it's ok to update those values.

And probably it would be better to move this logic into Partitions.

The last partition is handled by createLastSegmentIdWithShardSpec(). I've moved the dedup/first/last logic into Partitions (which is renamed to PartitionBoundaries.

jihoonson · 2019-12-05T21:45:23Z

+/**
+ * Convenience wrapper to make code more readable.
+ */
+public class Partitions extends ForwardingList<String> implements List<String>


I think the name of the class is a bit confusing to me. Should it be PartitionBoundaries?

jihoonson · 2019-12-05T21:51:35Z

+      new StringSketch();
+    }
+    catch (NoClassDefFoundError e) {
+      throw new ISE(e, "DataSketches is unvailable. Try adding the druid-datasketches extension to the classpath.");


How about Try loading the druid-datasketches extension in the overlord and middleManagers/indexers?

Adjusted the wording in the error message

jihoonson · 2019-12-05T22:06:27Z

+    ParallelIndexTuningConfig tuningConfig = ingestionSchema.getTuningConfig();
+
+    SingleDimensionPartitionsSpec partitionsSpec = (SingleDimensionPartitionsSpec) tuningConfig.getPartitionsSpec();
+    Preconditions.checkNotNull(partitionsSpec);


It would be nice to print that partitionsSpec in tuningConfig is null if it's null.

Added an error message

jihoonson · 2019-12-05T22:17:31Z

+      }
+    }
+
+    // UngroupedDimValueFilter may not accept the min/max dimensionValue. If needed, add the min/max


Please update the doc here.

Updated the class name

jihoonson · 2019-12-06T01:02:32Z

+ *
+ * @see PartialHashSegmentMergeParallelIndexTaskRunner
+ */
+class PartialRangeSegmentGenerateParallelIndexTaskRunner


Do you mean you're planning to do it later? Would you please open an issue for it then?

jihoonson · 2019-12-06T01:05:03Z

+ * partition key). The {@link ShardSpec} is later used by {@link PartialGenericSegmentMergeTask} to merge the partial
+ * segments.
+ */
+public class PartitionMetadata extends PartitionStat<ShardSpec>


Do you want to rename PartitionStat as well? I'm fine with doing it as a follow-up.

GeneratedHashPartitionsReport has a JSON field name of partitionStats, which needs to be preserved for backward compatibility. I think keeping the "Stat" suffix is nice for having "HashPartitionStats" match the JSON. To make things symmetric, I've renamed "PartitionMetadata" back to "GenericPartitionStats".

…partitioning

jihoonson · 2019-12-07T06:15:27Z

+    partitionBoundaries.set(0, null);
+
+    // Last partition ends with null (see StringPartitionChunk.isEnd())
+    partitionBoundaries.add(null);


Is there a reason to handle the first and last partition differently? Looks like the last partition will be (max, null) which could be empty or very small.

The last partition will never be empty because it'll have at least one row with the max value.

Previously, I had logic to combine it with the second-to-last partition when it was small:
#8925 (comment)

If it's still desired to not have that logic to decide whether to combine it or not, then it needs to either always be combined or never be combined.

My question is why the null is added at last instead of replacing the max just like the first partition like below.

// First partition starts with null (see StringPartitionChunk.isStart()) partitionBoundaries.set(0, null); // Last partition ends with null (see StringPartitionChunk.isEnd()) partitionBoundaries.set(partitionBoundaries.size() - 1, null);

What is the assumption behind handling the first partition and the last one differently?

jihoonson

@ccaominh thanks for the hard work. +1 once my last comment about renaming a test class is addressed.

jihoonson · 2019-12-09T17:27:50Z

+@Test(groups = TestNGGroup.BATCH_INDEX)
 @Guice(moduleFactory = DruidTestModuleFactory.class)
-public class ITParallelIndexTest extends AbstractITBatchIndexTest
+public class ITImperfectRollupParallelIndexTest extends AbstractITBatchIndexTest


ITBestEffortRollupParallelIndexTest? (https://druid.apache.org/docs/latest/ingestion/index.html#best-effort-rollup)

jihoonson

+1 after CI. Thanks!

jon-wei · 2019-12-09T22:07:30Z

+            Iterables.getOnlyElement(inputRow.getDimension(partitionDimension))
+        );
+
+        if (dimensionValue != null) {


What happens here if the row actually contained a null for the dimension? Does this need to distinguish that case?

I can change the behavior of DimensionValueFilter.accept() to handle that case.

jon-wei · 2019-12-09T23:19:44Z

+
+    TaskState distributionState = runNextPhase(distributionRunner);
+    if (distributionState.isFailure()) {
+      return TaskStatus.failure(getId());


Suggest adding an error message to the failure status indicating what phase failed

jon-wei

The design LGTM, didn't completely review the code

ccaominh force-pushed the superbatch-range-partitioning branch from 0130ff7 to a166b7e Compare November 22, 2019 04:39

ccaominh added 2 commits November 22, 2019 08:36

Fix dependencies & forbidden apis

1b72540

Fixes for integration test

a2f4877

jihoonson reviewed Nov 23, 2019

View reviewed changes

ccaominh mentioned this pull request Nov 25, 2019

use the latest release of datasketches #8647

Merged

jihoonson reviewed Nov 26, 2019

View reviewed changes

jihoonson reviewed Nov 27, 2019

View reviewed changes

ccaominh added 3 commits December 2, 2019 21:37

Address review comments

6211c41

Merge remote-tracking branch 'upstream/master' into superbatch-range-…

044bfef

…partitioning

Fix docs, strict compile, sketch check, rollup check

15d3522

gianm added Area - Batch Ingestion Feature labels Dec 4, 2019

ccaominh added 3 commits December 4, 2019 18:23

Fix first shard spec, partition serde, single subtask

5f10cae

Merge remote-tracking branch 'upstream/master' into superbatch-range-…

065923b

…partitioning

Fix first partition check in test

c338bc0

jihoonson reviewed Dec 6, 2019

View reviewed changes

ccaominh added 4 commits December 6, 2019 18:54

Misc rewording/refactoring to address code review

83ab7a8

Merge remote-tracking branch 'upstream/master' into superbatch-range-…

d6f4890

…partitioning

Fix doc link

ded55f9

Split batch index integration test

15235ea

jihoonson added the Design Review label Dec 7, 2019

jihoonson reviewed Dec 7, 2019

View reviewed changes

ccaominh added 3 commits December 6, 2019 22:33

Do not run parallel-batch-index twice

275ad8e

Adjust last partition

f40ed69

Split ITParallelIndexTest to reduce runtime

1753d64

jihoonson reviewed Dec 9, 2019

View reviewed changes

Rename test class

9621510

jihoonson approved these changes Dec 9, 2019

View reviewed changes

jon-wei added the Release Notes label Dec 9, 2019

jon-wei reviewed Dec 9, 2019

View reviewed changes

jon-wei reviewed Dec 10, 2019

View reviewed changes

ccaominh added 3 commits December 9, 2019 17:03

Allow null values in range partitions

8d714bd

Indicate which phase failed

e423be5

Improve asserts in tests

7622134

jon-wei approved these changes Dec 10, 2019

View reviewed changes

gianm approved these changes Dec 10, 2019

View reviewed changes

gianm merged commit bab78fc into apache:master Dec 10, 2019

gianm added this to the 0.17.0 milestone Dec 10, 2019

ccaominh deleted the superbatch-range-partitioning branch January 4, 2020 00:35

ccaominh mentioned this pull request Sep 23, 2020

Automatically determine numShards for parallel ingestion hash partitioning #10419

Merged

8 tasks

		@@ -63,19 +66,21 @@ interface IntervalToSegmentIdsCreator
		CachingLocalSegmentAllocator(

Conversation

ccaominh commented Nov 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Key changed/added classes in this PR

Uh oh!

jihoonson left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

ccaominh commented Nov 22, 2019 •

edited

Loading