Automatically determine numShards for parallel ingestion hash partitioning by jon-wei · Pull Request #10419 · apache/druid

jon-wei · 2020-09-22T10:36:56Z

This PR allows parallel batch ingestion to automatically determine numShards when forceGuaranteedRollup is used with hash partitioning.

This is accomplished with a new phase of subtasks (PartialDimensionCardinalityTask). These subtasks build a map of <Interval, HyperLogLogCollector> where the HLL collector records the cardinality of the partitioning dimensions for each segment granularity interval in the input data.

The supervisor task aggregates the HLL collectors by interval, and determines the highest cardinality across all the intervals. This max cardinality is divided by maxRowsPerSegment to determine numShards automatically.

This PR has:

been self-reviewed.
added documentation for new or modified features or behaviors.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added or updated version, license, or notice information in licenses.yaml
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
added integration tests.
been tested in a test Druid cluster.

…oning

ccaominh · 2020-09-22T22:53:32Z


  @VisibleForTesting
-  static List<Object> getGroupKey(final List<String> partitionDimensions, final long timestamp, final InputRow inputRow)
+  public static List<Object> getGroupKey(final List<String> partitionDimensions, final long timestamp, final InputRow inputRow)


Does it make sense to keep the @VisibleForTesting since this is now public (so that it can be used in PartialDimensionCardinalityTask)?

Removed the @VisibleForTesting

ccaominh · 2020-09-22T23:02:54Z

+    if (!(ingestionSchema.getTuningConfig().getPartitionsSpec() instanceof HashedPartitionsSpec)) {
+      // only range and hash partitioning is supported for multiphase parallel ingestion, see runMultiPhaseParallel()
+      throw new ISE(
+          "forceGuaranteedRollup is set but partitionsSpec [%s] is not a ranged or hash partition spec.",


Using "single_dim" instead of "ranged" in the error message maps better to the naming in the docs

Changed the message to use "single_dim"

ccaominh · 2020-09-22T23:11:32Z

+      //noinspection OptionalGetWithoutIsPresent (InputRowIterator returns rows with present intervals)
+      Interval interval = granularitySpec.bucketInterval(timestamp).get();
+
+      LOG.info("TS: " + timestamp + " INTV: " + interval + " GSC: " + granularitySpec.getClass());


What do you think about changing the logging level to reduce the logging amount since this is printed for each row?

I'll remove this, it was a debugging-only message that I don't think should be retained

This has been removed

ccaominh · 2020-09-22T23:15:57Z

+
+      LOG.info("TS: " + timestamp + " INTV: " + interval + " GSC: " + granularitySpec.getClass());
+
+      HyperLogLogCollector hllCollector = intervalToCardinalities.computeIfAbsent(


What do you think about using HllSketch instead of HyperLogLogCollector since HllSketch provides much more accurate estimates if the cardinality does not exceed the sketch's k value: http://datasketches.apache.org/docs/HLL/HllSketchVsDruidHyperLogLogCollector.html. Using HllSketch also will be more accurate and faster when the partial HLLs are merged.

Using HllSketch would mean that the implementation for parallel ingestion is different from the one for sequential ingestion though.

Using HllSketch would mean that the implementation for parallel ingestion is different from the one for sequential ingestion though.

That was my reasoning for using HyperLogLogCollector, but I think it makes sense to change these to use HllSketch. I'll update this one to use HllSketch and a follow-on could be to do the same for IndexTask.

I've updated this to use HllSketch instead

ccaominh · 2020-09-22T23:36:13Z

+  public boolean isReady(TaskActionClient taskActionClient) throws Exception
+  {
+    return tryTimeChunkLock(
+        taskActionClient,
+        getIngestionSchema().getDataSchema().getGranularitySpec().inputIntervals()
+    );
+  }


I believe the logic in PartialHashSegmentGenerateTask.isReady() will need to be adjusted; otherwise if PartialDimensionCardinalityTask runs and grabs the locks here it'll get stuck. For example, PartialRangeSegmentGenerateTask.isReady() does not grab the locks since they're acquired earlier by PartialDimensionDistributionTask.

I think a good place to add a test would be in HashPartitionMultiPhaseParallelIndexingTest. Currently, its test cases all specify a value of 2 for numShards, so we can add cases with null.

Thanks, will look into this

I didn't see any failures when I was testing this on a cluster or in unit tests with the current locking (maybe related to the javadoc comment on isReady(): "This method must be idempotent, as it may be run multiple times per task."?).

I updated PartialHashSegmentGenerateTask.isReady() to skip the lock acquisition if the numShardsOverride is set (indicating that that cardinality phase ran).

I think acquiring locks here should be fine since it's idempotent. The supervisor task and all its subtasks share the same lock based on their groupId. I actually think it's better to call tryTimeChunkLock() in every subtask since it will make the task fail early when its lock is revoked. Otherwise, the task will fail when it publishes segments which happens at the last stage in batch ingestion.

I seem to remember running into the self deadlock issue when implementing #8925, which is why the locks are only acquired in the first phase of the range partitioning subtasks. I don't remember if I discovered this via RangePartitionMultiPhaseParallelIndexingTest or ITPerfectRollupParallelIndexTest, but if the problem isn't showing up in this PR's tests, then that's good.

I wasn't able to reproduce the self deadlock with RangePartitionMultiPhaseParallelIndexingTest or ITPerfectRollupParallelIndexTest in master, so either I misremembered stuff or the issue has been fixed in the meantime.

ccaominh · 2020-09-22T23:40:12Z

  public String getForceGuaranteedRollupIncompatiblityReason()
  {
-    return getNumShards() == null ? NUM_SHARDS + " must be specified" : FORCE_GUARANTEED_ROLLUP_COMPATIBLE;
+    return FORCE_GUARANTEED_ROLLUP_COMPATIBLE;


https://github.com/apache/druid/blob/master/docs/ingestion/native-batch.md#hash-based-partitioning needs to be updated to say that numShards is no longer required and also to mention the new partial dimension cardinality task.

Updated the docs to mark numShards as optional and added a description of the new cardinality scan phase.

ccaominh · 2020-09-22T23:47:41Z

+
+    final Integer numShardsOverride;
+    HashedPartitionsSpec partitionsSpec = (HashedPartitionsSpec) ingestionSchema.getTuningConfig().getPartitionsSpec();
+    if (ingestionSchema.getTuningConfig().isForceGuaranteedRollup() && partitionsSpec.getNumShards() == null) {


Checking isForceGuaranteedRollup() is probably redundant here

Removed the isForceGuaranteedRollup check here

ccaominh · 2020-09-22T23:51:39Z

+import java.util.Map;
+
+@RunWith(Enclosed.class)
+public class PartialDimensionCardinalityTaskTest


I like the extensive set of tests!

abhishekagarwal87 · 2020-09-23T09:54:18Z

+    });
+
+    // determine the highest cardinality in any interval
+    long maxCardinality = Long.MIN_VALUE;


how about using 0 instead here?

Changed this to 0

abhishekagarwal87 · 2020-09-23T10:22:36Z

+
+      try {
+        hllSketch.update(
+            IndexTask.HASH_FUNCTION.hashBytes(jsonMapper.writeValueAsBytes(groupKey)).asBytes()


out of curiosity, what is the effect if we pass jsonMapper.writeValueAsBytes(groupKey) directly to hllSketch? does it affect performance or accuracy in any way? hllSketch is already doing hashing on the byte array input.

Good point, I got rid of the first level of hashing there, it should be more accurate this way

jihoonson · 2020-09-23T20:29:52Z

 |--------|-----------|-------|---------|
 |type|This should always be `hashed`|none|yes|
-|numShards|Directly specify the number of shards to create. If this is specified and `intervals` is specified in the `granularitySpec`, the index task can skip the determine intervals/partitions pass through the data.|null|yes|
+|numShards|Directly specify the number of shards to create. If this is specified and `intervals` is specified in the `granularitySpec`, the index task can skip the determine intervals/partitions pass through the data.|null|no|


Please add the new parameter and its description.

I don't think I added any new parameters to HashedPartitionsSpec, were you referring to something else?

Oh, I should have been more specific. I meant, the newly supported parameter, maxRowsPerSegment (or targetRowsPerSegment which I suggested below). Neither of them is not described here since we didn't support it before. Maybe https://github.com/apache/druid/blob/master/docs/ingestion/hadoop.md#hash-based-partitioning helps.

Added docs for targetRowsPerSegment

jihoonson · 2020-09-23T20:58:59Z

+The Parallel task splits the input data and assigns them to worker tasks based on the split hint spec.
+Each worker task (type `partial_dimension_cardinality`) gathers estimates of partitioning dimensions cardinality for
+each time chunk. The Parallel task will aggregate these estimates from the worker tasks and determine the highest
+cardinality across all of the time chunks in the input data, dividing this cardinality by `maxRowsPerSegment` to


It seems like that it will guarantee that you will not have segments than the computed numShards, but it will not be guaranteed that number of rows per segment doesn't exceed maxRowsPerSegment since the partition dimensions can be skewed. Is this correct? Then, I suggest targetRowsPerSegment since it's not a hard limit.

I went with targetRowsPerSegment

jihoonson · 2020-09-23T21:05:39Z

+      return Math.toIntExact(numShards);
+    }
+    catch (ArithmeticException ae) {
+      return Integer.MAX_VALUE;


Hmm.. Should we fail instead? Since the timeline in the coordinator and the broker will explode if you have this many segments per interval.

I changed this to throw an exception now

jihoonson · 2020-09-23T21:09:08Z


 The Parallel task with hash-based partitioning is similar to [MapReduce](https://en.wikipedia.org/wiki/MapReduce).
-The task runs in 2 phases, i.e., `partial segment generation` and `partial segment merge`.
+The task runs in up to 3 phases: `partial_dimension_cardinality`, `partial segment generation` and `partial segment merge`.


When I wrote this, my intention was showing the name of phases (e.g., https://github.com/apache/druid/pull/10419/files#diff-05bbc55d565a3d9462353d9b4771cb09R34). This phase name is not shown anywhere currently, but will be available in the task live reports and metrics after #10352. How about removing underscores from the here too?

Removed the underscores here

jihoonson · 2020-09-23T21:15:18Z

+  public boolean isReady(TaskActionClient taskActionClient) throws Exception
+  {
+    return tryTimeChunkLock(
+        taskActionClient,
+        getIngestionSchema().getDataSchema().getGranularitySpec().inputIntervals()
+    );
+  }


I think acquiring locks here should be fine since it's idempotent. The supervisor task and all its subtasks share the same lock based on their groupId. I actually think it's better to call tryTimeChunkLock() in every subtask since it will make the task fail early when its lock is revoked. Otherwise, the task will fail when it publishes segments which happens at the last stage in batch ingestion.

jihoonson · 2020-09-23T21:19:17Z

+      );
+      List<Object> groupKey = HashBasedNumberedShardSpec.getGroupKey(
+          partitionDimensions,
+          interval.getStartMillis(),


This timestamp should be bucketed based on the query granularity. See https://github.com/apache/druid/blob/master/core/src/main/java/org/apache/druid/timeline/partition/HashBasedNumberedShardSpec.java#L146.

Ah, thanks, I fixed this and updated PartialDimensionCardinalityTaskTest.sendsCorrectReportWithMultipleIntervalsInData() to test that case

jihoonson · 2020-09-23T21:20:37Z

+    Map<Interval, HllSketch> intervalToCardinalities = new HashMap<>();
+    while (inputRowIterator.hasNext()) {
+      InputRow inputRow = inputRowIterator.next();
+      if (inputRow == null) {


nit: inputRow is safely non-null since FilteringCloseableInputRowIterator filters out all null rows. See AbstractBatchIndexTask.defaultRowFilter().

Removed the null check

jihoonson · 2020-09-23T21:23:56Z

+            parseExceptionHandler
+        );
+        HandlingInputRowIterator iterator =
+            new DefaultIndexTaskInputRowIteratorBuilder()


nit: DefaultIndexTaskInputRowIteratorBuilder effectively does nothing here since its core functionality has been moved to FilteringCloseableInputRowIterator in #10336. I haven't cleaned up this interface yet. Now, it's only useful in range partitioning as some more useful inputRowHandlers are appended to the default builder.

I changed this to just use the iterator above this directly

jon-wei · 2020-09-24T00:54:20Z

I added the locking back on PartialHashSegmentGenerateTask to address #10419 (comment)

jihoonson

LGTM. Thanks @jon-wei!

jihoonson · 2020-09-24T18:20:04Z

One thing to note: after this PR, the parallel task can compute the number of partitions automatically, but the same number will be applied to all intervals. I think it will be better to compute and apply different numbers of partitions per interval to handle potential data skew between intervals which is what the simple task (IndexTask) does. But I'm OK with doing this improvement as a follow-up.

Specifying `numShards` for hashed partitions is no longer required after apache#10419. Update the UI to make `numShards` an optional field for hash partitions.

* Compaction config UI optional numShards Specifying `numShards` for hashed partitions is no longer required after #10419. Update the UI to make `numShards` an optional field for hash partitions. * Update snapshot

…oning (apache#10419) * Automatically determine numShards for parallel ingestion hash partitioning * Fix inspection, tests, coverage * Docs and some PR comments * Adjust locking * Use HllSketch instead of HyperLogLogCollector * Fix tests * Address some PR comments * Fix granularity bug * Small doc fix

* Compaction config UI optional numShards Specifying `numShards` for hashed partitions is no longer required after apache#10419. Update the UI to make `numShards` an optional field for hash partitions. * Update snapshot

Automatically determine numShards for parallel ingestion hash partiti…

61f8caa

…oning

jon-wei added Ease of Use Area - Batch Ingestion labels Sep 22, 2020

Fix inspection, tests, coverage

85e0651

ccaominh mentioned this pull request Sep 22, 2020

Web console: compaction dialog update #10417

Merged

ccaominh requested changes Sep 22, 2020

View reviewed changes

jon-wei added 4 commits September 22, 2020 19:12

Docs and some PR comments

1ff1793

Adjust locking

0125a56

Use HllSketch instead of HyperLogLogCollector

66365b5

Fix tests

6d694d2

abhishekagarwal87 reviewed Sep 23, 2020

View reviewed changes

ccaominh approved these changes Sep 23, 2020

View reviewed changes

jihoonson reviewed Sep 23, 2020

View reviewed changes

jon-wei added 3 commits September 23, 2020 17:26

Address some PR comments

ee47386

Fix granularity bug

5e79297

Small doc fix

b424712

jihoonson approved these changes Sep 24, 2020

View reviewed changes

ccaominh approved these changes Sep 24, 2020

View reviewed changes

jon-wei merged commit cb30b1f into apache:master Sep 24, 2020

ccaominh mentioned this pull request Sep 28, 2020

Compaction config UI optional numShards #10446

Merged

3 tasks

jon-wei added this to the 0.20.0 milestone Sep 30, 2020

jon-wei mentioned this pull request Oct 2, 2020

[DRAFT] 0.20.0 Release Notes #10462

Closed


		LOG.info("TS: " + timestamp + " INTV: " + interval + " GSC: " + granularitySpec.getClass());

		HyperLogLogCollector hllCollector = intervalToCardinalities.computeIfAbsent(

Conversation

jon-wei commented Sep 22, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ccaominh Sep 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

ccaominh Sep 22, 2020 •

edited

Loading