Use targetRowsPerSegment for single-dim partitions by ccaominh · Pull Request #8624 · apache/druid

ccaominh · 2019-10-03T19:21:05Z

Description

When using single-dimension partitioning, use targetRowsPerSegment (if specified) to size segments. Previously, single-dimension partitioning would always size segments as close to the max size as possible.

This change restores the intended behavior that existed prior to the refactoring done in #8141 :
https://github.com/apache/incubator-druid/pull/8141/files#diff-c285c81cfeec663a227ddf5e241d9effL57
https://github.com/apache/incubator-druid/pull/8141/files#diff-fd61251f2512217838c09a27d1d49a0bL335

Also, change single-dimension partitioning to allow partitions that have a size equal to the target or max size. Previously, it would create partitions up to 1 less than those limits. This change, makes the behavior consistent with the behavior of the row-count-based limits in DynamicPartitionsSpec.

Also, fix some IntelliJ inspection warnings in HadoopDruidIndexerConfig.

This PR has:

been self-reviewed.
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths.

When using single-dimension partitioning, use targetRowsPerSegment (if specified) to size segments. Previously, single-dimension partitioning would always size segments as close to the max size as possible. Also, change single-dimension partitioning to allow partitions that have a size equal to the target or max size. Previously, it would create partitions up to 1 less than those limits. Also, fix some IntelliJ inspection warnings in HadoopDruidIndexerConfig.

ccaominh · 2019-10-03T19:21:42Z

  public int getTargetPartitionSize()
  {
-    final Integer targetPartitionSize = schema.getTuningConfig().getPartitionsSpec().getMaxRowsPerSegment();
+    DimensionBasedPartitionsSpec spec = schema.getTuningConfig().getPartitionsSpec();


This is the only functional change in this file

ccaominh · 2019-10-03T19:27:12Z

                    "2014102200,h.example.com,US,251",
                    "2014102200,i.example.com,US,963",
-                    "2014102200,j.example.com,US,333",
-                    "2014102200,k.example.com,US,555"


The behavior of single-dim partitions was changed to size up to the max size instead of up to 1 less.

Before, this test would create partitions of sizes [5, 5, 1], which would be converted to [5, 6], so both would be under the max size of 6.

After the change to size up to the max size, this test creates partitions of sizes [5, 5, 1], which would be converted to [5, 6], and then rejected since the last partition exceeds the max size of 5. Thus the last row was removed from the input data, so that the partition sizes would be [5, 5].

clintropolis

overall lgtm, +1 after explanation for behavior change

clintropolis · 2019-10-09T09:31:34Z


        // See if we need to cut a new partition ending immediately before this dimension value
-        if (currentDimPartition.rows > 0 && currentDimPartition.rows + dvc.numRows >= config.getTargetPartitionSize()) {
+        if (currentDimPartition.rows > 0 && currentDimPartition.rows + dvc.numRows > config.getTargetPartitionSize()) {


What is the justification for making this change in behavior that has existed for over 6 years? I didn't see anything in the PR description to answer the 'why', just mention that it has been changed. Could you please update the PR description as well with whatever is the answer? It doesn't seem to make much difference and I have no opinions either way, just curious what motivated this.

When I was adding tests, I was expecting partitions sizes to match the expected target but they were sized to one less than the target because of this line. In practice, since the partition sizes are much larger than 1, the difference is negligible, so I'm ok with reverting this change too.

Related to your suggestion, I'll update the PR description to explain why the single-dim partitioning is being changed to use targetRowsPerSegment.

Ah, I don't think it needs reverted, I just wanted to know the motivation. It is probably best to look at how other similar row oriented indexing limits are handled in other indexing types to make sure the behavior is consistent everywhere if possible

Looks like DynamicPartitionsSpec.maxRowsPerSegment and maxTotalRows allows sizes that equal the limit:
https://github.com/apache/incubator-druid/blob/master/server/src/main/java/org/apache/druid/segment/realtime/appenderator/AppenderatorDriverAddResult.java#L103
https://github.com/apache/incubator-druid/blob/master/indexing-service/src/main/java/org/apache/druid/indexing/common/task/batch/parallel/SinglePhaseSubTask.java#L462

Updated the PR description to mention the motivation for consistent behavior.

clintropolis · 2019-10-09T09:35:14Z

 public class DeterminePartitionsJobTest
 {
+  @Nullable
+  private static final Long NO_TARGET_ROWS_PER_SEGMENT = null;


nit: this seems less clear to me than just using null inline, especially since intellij displays parameter name for literal values by default.

The named constant can be useful for tools that do not display the parameter names (e.g., diff tools like the github PR diff). Perhaps the name can be improved (e.g., DEFAULT_TARGET_ROWS_PER_SEGMENT)? In this particular case, since it's an Object[][], IntelliJ doesn't display the parameter names.

clintropolis · 2019-10-09T09:37:49Z

+    Assert.assertEquals(maxRowsPerSegment, targetPartitionSize);
+  }
+
+  private static class HadoopIngestionSpecBuilder


clintropolis · 2019-10-09T09:38:42Z

                },
                ImmutableList.of(
                    "2014102200,a.example.com,CN,100",
-                    "2014102200,b.exmaple.com,US,50",


jnaous · 2019-10-10T17:36:57Z

I'll also add that if the default is change in the future, this makes refactoring easier. I think we should use constants instead of literal values generally because they make refactoring easier and make it easier to understand why a particular value is used.

…

On Wed, Oct 9, 2019 at 1:54 PM Chi Cao Minh ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In indexing-hadoop/src/main/java/org/apache/druid/indexer/DeterminePartitionsJob.java <#8624 (comment)> : > @@ -661,7 +661,7 @@ protected void innerReduce(Context context, SortableBytes keyBytes, Iterable<Dim } // See if we need to cut a new partition ending immediately before this dimension value - if (currentDimPartition.rows > 0 && currentDimPartition.rows + dvc.numRows >= config.getTargetPartitionSize()) { + if (currentDimPartition.rows > 0 && currentDimPartition.rows + dvc.numRows > config.getTargetPartitionSize()) { When I was adding tests, I was expecting partitions sizes to match the expected target but they were sized to one less than the target because of this line. In practice, since the partition sizes are much larger than 1, the difference is negligible, so I'm ok with reverting this change too. Related to your suggestion, I'll update the PR description to explain why the single-dim partitioning is being changed to use targetRowsPerSegment. ------------------------------ In indexing-hadoop/src/test/java/org/apache/druid/indexer/DeterminePartitionsJobTest.java <#8624 (comment)> : > @@ -51,6 +52,11 @@ @RunWith(Parameterized.class) public class DeterminePartitionsJobTest { + @nullable + private static final Long NO_TARGET_ROWS_PER_SEGMENT = null; The named constant can be useful for tools that do not display the parameter names (e.g., diff tools like the github PR diff). Perhaps the name can be improved (e.g., DEFAULT_TARGET_ROWS_PER_SEGMENT)? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#8624>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAPSYCWTXLEWBX26MVHWORLQNZAHPANCNFSM4I5HFRZQ> .

-- Jad Naous Imply | VP R&D 650-521-3425 jad.naous@imply.io

clintropolis

👍

ccaominh commented Oct 3, 2019

View reviewed changes

clintropolis added the Area - Batch Ingestion label Oct 4, 2019

clintropolis reviewed Oct 9, 2019

View reviewed changes

clintropolis approved these changes Oct 17, 2019

View reviewed changes

clintropolis merged commit 8b2afa5 into apache:master Oct 17, 2019

ccaominh deleted the single-dim-use-target branch November 1, 2019 17:26

jon-wei added this to the 0.17.0 milestone Dec 17, 2019

Conversation

ccaominh commented Oct 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

clintropolis left a comment

Choose a reason for hiding this comment

Uh oh!

clintropolis Oct 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ccaominh Oct 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnaous commented Oct 10, 2019 via email

Uh oh!

clintropolis left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ccaominh commented Oct 3, 2019 •

edited

Loading

clintropolis Oct 9, 2019 •

edited

Loading

ccaominh Oct 9, 2019 •

edited

Loading