Skip to content

Use targetRowsPerSegment for single-dim partitions#8624

Merged
clintropolis merged 1 commit intoapache:masterfrom
ccaominh:single-dim-use-target
Oct 17, 2019
Merged

Use targetRowsPerSegment for single-dim partitions#8624
clintropolis merged 1 commit intoapache:masterfrom
ccaominh:single-dim-use-target

Conversation

@ccaominh
Copy link
Copy Markdown
Contributor

@ccaominh ccaominh commented Oct 3, 2019

Description

When using single-dimension partitioning, use targetRowsPerSegment (if specified) to size segments. Previously, single-dimension partitioning would always size segments as close to the max size as possible.

This change restores the intended behavior that existed prior to the refactoring done in #8141 :
https://github.com/apache/incubator-druid/pull/8141/files#diff-c285c81cfeec663a227ddf5e241d9effL57
https://github.com/apache/incubator-druid/pull/8141/files#diff-fd61251f2512217838c09a27d1d49a0bL335

Also, change single-dimension partitioning to allow partitions that have a size equal to the target or max size. Previously, it would create partitions up to 1 less than those limits. This change, makes the behavior consistent with the behavior of the row-count-based limits in DynamicPartitionsSpec.

Also, fix some IntelliJ inspection warnings in HadoopDruidIndexerConfig.


This PR has:

  • been self-reviewed.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths.

When using single-dimension partitioning, use targetRowsPerSegment (if
specified) to size segments. Previously, single-dimension partitioning
would always size segments as close to the max size as possible.

Also, change single-dimension partitioning to allow partitions that have
a size equal to the target or max size. Previously, it would create
partitions up to 1 less than those limits.

Also, fix some IntelliJ inspection warnings in HadoopDruidIndexerConfig.
public int getTargetPartitionSize()
{
final Integer targetPartitionSize = schema.getTuningConfig().getPartitionsSpec().getMaxRowsPerSegment();
DimensionBasedPartitionsSpec spec = schema.getTuningConfig().getPartitionsSpec();
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only functional change in this file

"2014102200,h.example.com,US,251",
"2014102200,i.example.com,US,963",
"2014102200,j.example.com,US,333",
"2014102200,k.example.com,US,555"
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior of single-dim partitions was changed to size up to the max size instead of up to 1 less.

Before, this test would create partitions of sizes [5, 5, 1], which would be converted to [5, 6], so both would be under the max size of 6.

After the change to size up to the max size, this test creates partitions of sizes [5, 5, 1], which would be converted to [5, 6], and then rejected since the last partition exceeds the max size of 5. Thus the last row was removed from the input data, so that the partition sizes would be [5, 5].

Copy link
Copy Markdown
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall lgtm, +1 after explanation for behavior change


// See if we need to cut a new partition ending immediately before this dimension value
if (currentDimPartition.rows > 0 && currentDimPartition.rows + dvc.numRows >= config.getTargetPartitionSize()) {
if (currentDimPartition.rows > 0 && currentDimPartition.rows + dvc.numRows > config.getTargetPartitionSize()) {
Copy link
Copy Markdown
Member

@clintropolis clintropolis Oct 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the justification for making this change in behavior that has existed for over 6 years? I didn't see anything in the PR description to answer the 'why', just mention that it has been changed. Could you please update the PR description as well with whatever is the answer? It doesn't seem to make much difference and I have no opinions either way, just curious what motivated this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I was adding tests, I was expecting partitions sizes to match the expected target but they were sized to one less than the target because of this line. In practice, since the partition sizes are much larger than 1, the difference is negligible, so I'm ok with reverting this change too.

Related to your suggestion, I'll update the PR description to explain why the single-dim partitioning is being changed to use targetRowsPerSegment.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I don't think it needs reverted, I just wanted to know the motivation. It is probably best to look at how other similar row oriented indexing limits are handled in other indexing types to make sure the behavior is consistent everywhere if possible

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the PR description to mention the motivation for consistent behavior.

public class DeterminePartitionsJobTest
{
@Nullable
private static final Long NO_TARGET_ROWS_PER_SEGMENT = null;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this seems less clear to me than just using null inline, especially since intellij displays parameter name for literal values by default.

Copy link
Copy Markdown
Contributor Author

@ccaominh ccaominh Oct 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The named constant can be useful for tools that do not display the parameter names (e.g., diff tools like the github PR diff). Perhaps the name can be improved (e.g., DEFAULT_TARGET_ROWS_PER_SEGMENT)? In this particular case, since it's an Object[][], IntelliJ doesn't display the parameter names.

Assert.assertEquals(maxRowsPerSegment, targetPartitionSize);
}

private static class HadoopIngestionSpecBuilder
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

},
ImmutableList.of(
"2014102200,a.example.com,CN,100",
"2014102200,b.exmaple.com,US,50",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heh

@jnaous
Copy link
Copy Markdown
Contributor

jnaous commented Oct 10, 2019 via email

Copy link
Copy Markdown
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@clintropolis clintropolis merged commit 8b2afa5 into apache:master Oct 17, 2019
@ccaominh ccaominh deleted the single-dim-use-target branch November 1, 2019 17:26
@jon-wei jon-wei added this to the 0.17.0 milestone Dec 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants