Use PartitionsSpec for all task types by jihoonson · Pull Request #8141 · apache/druid

jihoonson · 2019-07-23T22:58:47Z

Part of #8061.

Description

PartitionsSpec is a class to describe the secondary partitioning method for data ingestion, but is being used by only Hadoop tasks. For more consistent behavior and configuration, all task types should use the same PartitionsSpec.

PartitionsSpec is the top interface and has one direct implementation, DynamicPartitionsSpec. DynamicPartitionsSpec is the new partitionsSpec and used by indexTask and kafka/kinesis IndexTasks.

DimensionBasedPartitionsSpec is the child interface of PartitionsSpec and represents the partitionsSpec based on dimension values. It has two implementations of HashedPartitionsSpec and SingleDimensionPartitionsSpec. These partitionsSpecs are used if and only if perfect rollup is configured.

This PR is backward-Incompatible for tasks which use indexTuningConfig (indexTask, compactionTask, and parallelIndexTask) because the JSON form of tuningConfig doesn't have maxRowsPerSegment, maxTotalRows, numShards, and partitionDimensions anymore. However, it still could read the old JSON format. It should be compatible for other task types.

This PR has:

been self-reviewed.
added documentation for new or modified features or behaviors.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.

clintropolis

this seems like a nice change 👍

clintropolis · 2019-07-24T22:25:21Z

-  @JsonIgnore
-  boolean isDeterminingPartitions();
+  /**
+   * Returns true if this partitionsSpec needs to determine the number of partitions to start data ingetsion.


typo: ingetsion

Thanks, fixed.

clintropolis · 2019-07-24T22:31:43Z

-    }
-
-    public IndexTuningConfig withMaxRowsPerSegment(int maxRowsPerSegment)
+    public IndexTuningConfig withPartitoinsSpec(PartitionsSpec partitionsSpec)


typo: withPartitoinsSpec

Thanks, fixed.

clintropolis · 2019-07-24T22:34:38Z


            if (addResult.isOk()) {
-              if (addResult.isPushRequired(tuningConfig)) {
+              final boolean isPushRequired =


Any reason not to push this down into the AppenderatorDriverAddResult like it was previously? Is the other AppenderatorDriverAddResult.isPushRequired method still legitimately used?

Hmm, yeah it looks better to use the existing one. Reverted to use it.

clintropolis · 2019-07-24T22:36:15Z

                      // If the number of rows in the segment exceeds the threshold after adding a row,
                      // move the segment out from the active segments of BaseAppenderatorDriver to make a new segment.
-                      if (addResult.isPushRequired(tuningConfig) && !sequenceToUse.isCheckpointed()) {
+                      final boolean isPushRequired =


same thing re isPushRequired

clintropolis

lgtm 👍

…tions-spec

dclim · 2019-07-26T07:58:20Z

-    this.partitionDimensions = partitionDimensions == null ? DEFAULT_PARTITION_DIMENSIONS : partitionDimensions;
+    Preconditions.checkArgument(
+        PartitionsSpec.isEffectivelyNull(maxRowsPerSegment) || PartitionsSpec.isEffectivelyNull(numShards),
+        "Can't use maxRowsPerSegment and numShards together"


When this is called through HadoopHashedPartitionsSpec, the field there is called targetPartitionSize instead of maxRowsPerSegment, so it might be clearer to indicate that name here as well in addition to maxRowsPerSegment

dclim · 2019-07-26T07:58:29Z

+        PartitionsSpec.isEffectivelyNull(maxRowsPerSegment) || PartitionsSpec.isEffectivelyNull(numShards),
+        "Can't use maxRowsPerSegment and numShards together"
+    );
+    // Needs to determin partitions if the _given_ numShards is null


determin -> determine

Thanks, fixed.

dclim · 2019-07-26T08:10:49Z

+    Preconditions.checkArgument(maxRowsPerSegment > 0, "maxRowsPerSegment must be specified");
+    this.maxRowsPerSegment = maxRowsPerSegment;
+    this.maxPartitionSize = PartitionsSpec.isEffectivelyNull(maxPartitionSize)
+                            ? Math.multiplyExact(maxRowsPerSegment, (int) (maxRowsPerSegment * 0.5))


Is this calculation right? maxPartitionSize previously defaulted to 50% more than targetPartitionSize - this is targetPartitionSize * targetPartitionSize * 0.5

Nice finding! This should be Math.addExact(maxRowsPerSegment, (int) (maxRowsPerSegment * 0.5)). Thanks.

dclim · 2019-07-26T18:06:03Z

+      } else {
+        if (forceGuaranteedRollup) {
+          if (!(partitionsSpec instanceof HashedPartitionsSpec)) {
+            throw new ISE("HashedPartitonsSpec must be used for perfect rollup");


Typo: HashedPartitionsSpec

Thanks, fixed.

dclim · 2019-07-30T01:50:52Z

👍 after conflict resolved

…tions-spec

jihoonson · 2019-07-31T00:24:50Z

@dclim @clintropolis thank you for the review!

jihoonson added 3 commits July 23, 2019 14:23

Use partitionsSpec for all task types

bda8951

Merge branch 'master' of github.com:druid-io/druid into partitions-spec

8588126

fix doc

d3adefa

jihoonson added Area - Batch Ingestion Incompatible Refactoring Design Review labels Jul 23, 2019

clintropolis reviewed Jul 24, 2019

View reviewed changes

fix typos and revert to use isPushRequired

418817f

clintropolis approved these changes Jul 24, 2019

View reviewed changes

Merge branch 'master' of github.com:apache/incubator-druid into parti…

6f9412a

…tions-spec

dclim self-assigned this Jul 25, 2019

dclim requested changes Jul 26, 2019

View reviewed changes

jihoonson added 3 commits July 26, 2019 13:22

address comments

eb026e4

move partitionsSpec to core

fcee281

remove hadoopPartitionsSpec

a1142f3

dclim approved these changes Jul 30, 2019

View reviewed changes

Merge branch 'master' of github.com:apache/incubator-druid into parti…

2aefec3

…tions-spec

jihoonson merged commit 385f492 into apache:master Jul 31, 2019

clintropolis added this to the 0.16.0 milestone Aug 8, 2019

jihoonson mentioned this pull request Aug 14, 2019

Fix wrong partitionsSpec type names in the document #8297

Merged

3 tasks

clintropolis mentioned this pull request Aug 21, 2019

0.16.0-incubating release notes #8369

Closed

ccaominh mentioned this pull request Sep 18, 2019

Rename partition spec fields #8507

Merged

5 tasks

ccaominh mentioned this pull request Oct 9, 2019

Use targetRowsPerSegment for single-dim partitions #8624

Merged

3 tasks

kfaraz mentioned this pull request Jul 29, 2024

Remove deprecated maxRowsPerSegment from batch task tuning configs #16811

Closed

10 tasks

Conversation

jihoonson commented Jul 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

clintropolis left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

clintropolis left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dclim commented Jul 30, 2019

Uh oh!

jihoonson commented Jul 31, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jihoonson commented Jul 23, 2019 •

edited

Loading