Use targetRowsPerSegment for single-dim partitions#8624
Use targetRowsPerSegment for single-dim partitions#8624clintropolis merged 1 commit intoapache:masterfrom
Conversation
When using single-dimension partitioning, use targetRowsPerSegment (if specified) to size segments. Previously, single-dimension partitioning would always size segments as close to the max size as possible. Also, change single-dimension partitioning to allow partitions that have a size equal to the target or max size. Previously, it would create partitions up to 1 less than those limits. Also, fix some IntelliJ inspection warnings in HadoopDruidIndexerConfig.
| public int getTargetPartitionSize() | ||
| { | ||
| final Integer targetPartitionSize = schema.getTuningConfig().getPartitionsSpec().getMaxRowsPerSegment(); | ||
| DimensionBasedPartitionsSpec spec = schema.getTuningConfig().getPartitionsSpec(); |
There was a problem hiding this comment.
This is the only functional change in this file
| "2014102200,h.example.com,US,251", | ||
| "2014102200,i.example.com,US,963", | ||
| "2014102200,j.example.com,US,333", | ||
| "2014102200,k.example.com,US,555" |
There was a problem hiding this comment.
The behavior of single-dim partitions was changed to size up to the max size instead of up to 1 less.
Before, this test would create partitions of sizes [5, 5, 1], which would be converted to [5, 6], so both would be under the max size of 6.
After the change to size up to the max size, this test creates partitions of sizes [5, 5, 1], which would be converted to [5, 6], and then rejected since the last partition exceeds the max size of 5. Thus the last row was removed from the input data, so that the partition sizes would be [5, 5].
clintropolis
left a comment
There was a problem hiding this comment.
overall lgtm, +1 after explanation for behavior change
|
|
||
| // See if we need to cut a new partition ending immediately before this dimension value | ||
| if (currentDimPartition.rows > 0 && currentDimPartition.rows + dvc.numRows >= config.getTargetPartitionSize()) { | ||
| if (currentDimPartition.rows > 0 && currentDimPartition.rows + dvc.numRows > config.getTargetPartitionSize()) { |
There was a problem hiding this comment.
What is the justification for making this change in behavior that has existed for over 6 years? I didn't see anything in the PR description to answer the 'why', just mention that it has been changed. Could you please update the PR description as well with whatever is the answer? It doesn't seem to make much difference and I have no opinions either way, just curious what motivated this.
There was a problem hiding this comment.
When I was adding tests, I was expecting partitions sizes to match the expected target but they were sized to one less than the target because of this line. In practice, since the partition sizes are much larger than 1, the difference is negligible, so I'm ok with reverting this change too.
Related to your suggestion, I'll update the PR description to explain why the single-dim partitioning is being changed to use targetRowsPerSegment.
There was a problem hiding this comment.
Ah, I don't think it needs reverted, I just wanted to know the motivation. It is probably best to look at how other similar row oriented indexing limits are handled in other indexing types to make sure the behavior is consistent everywhere if possible
There was a problem hiding this comment.
Looks like DynamicPartitionsSpec.maxRowsPerSegment and maxTotalRows allows sizes that equal the limit:
https://github.com/apache/incubator-druid/blob/master/server/src/main/java/org/apache/druid/segment/realtime/appenderator/AppenderatorDriverAddResult.java#L103
https://github.com/apache/incubator-druid/blob/master/indexing-service/src/main/java/org/apache/druid/indexing/common/task/batch/parallel/SinglePhaseSubTask.java#L462
There was a problem hiding this comment.
Updated the PR description to mention the motivation for consistent behavior.
| public class DeterminePartitionsJobTest | ||
| { | ||
| @Nullable | ||
| private static final Long NO_TARGET_ROWS_PER_SEGMENT = null; |
There was a problem hiding this comment.
nit: this seems less clear to me than just using null inline, especially since intellij displays parameter name for literal values by default.
There was a problem hiding this comment.
The named constant can be useful for tools that do not display the parameter names (e.g., diff tools like the github PR diff). Perhaps the name can be improved (e.g., DEFAULT_TARGET_ROWS_PER_SEGMENT)? In this particular case, since it's an Object[][], IntelliJ doesn't display the parameter names.
| Assert.assertEquals(maxRowsPerSegment, targetPartitionSize); | ||
| } | ||
|
|
||
| private static class HadoopIngestionSpecBuilder |
| }, | ||
| ImmutableList.of( | ||
| "2014102200,a.example.com,CN,100", | ||
| "2014102200,b.exmaple.com,US,50", |
|
I'll also add that if the default is change in the future, this makes
refactoring easier. I think we should use constants instead of literal
values generally because they make refactoring easier and make it easier to
understand why a particular value is used.
…On Wed, Oct 9, 2019 at 1:54 PM Chi Cao Minh ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In
indexing-hadoop/src/main/java/org/apache/druid/indexer/DeterminePartitionsJob.java
<#8624 (comment)>
:
> @@ -661,7 +661,7 @@ protected void innerReduce(Context context, SortableBytes keyBytes, Iterable<Dim
}
// See if we need to cut a new partition ending immediately before this dimension value
- if (currentDimPartition.rows > 0 && currentDimPartition.rows + dvc.numRows >= config.getTargetPartitionSize()) {
+ if (currentDimPartition.rows > 0 && currentDimPartition.rows + dvc.numRows > config.getTargetPartitionSize()) {
When I was adding tests, I was expecting partitions sizes to match the
expected target but they were sized to one less than the target because of
this line. In practice, since the partition sizes are much larger than 1,
the difference is negligible, so I'm ok with reverting this change too.
Related to your suggestion, I'll update the PR description to explain why
the single-dim partitioning is being changed to use targetRowsPerSegment.
------------------------------
In
indexing-hadoop/src/test/java/org/apache/druid/indexer/DeterminePartitionsJobTest.java
<#8624 (comment)>
:
> @@ -51,6 +52,11 @@
@RunWith(Parameterized.class)
public class DeterminePartitionsJobTest
{
+ @nullable
+ private static final Long NO_TARGET_ROWS_PER_SEGMENT = null;
The named constant can be useful for tools that do not display the
parameter names (e.g., diff tools like the github PR diff). Perhaps the
name can be improved (e.g., DEFAULT_TARGET_ROWS_PER_SEGMENT)?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#8624>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAPSYCWTXLEWBX26MVHWORLQNZAHPANCNFSM4I5HFRZQ>
.
--
Jad Naous
Imply | VP R&D
650-521-3425
jad.naous@imply.io
|
Description
When using single-dimension partitioning, use targetRowsPerSegment (if specified) to size segments. Previously, single-dimension partitioning would always size segments as close to the max size as possible.
This change restores the intended behavior that existed prior to the refactoring done in #8141 :
https://github.com/apache/incubator-druid/pull/8141/files#diff-c285c81cfeec663a227ddf5e241d9effL57
https://github.com/apache/incubator-druid/pull/8141/files#diff-fd61251f2512217838c09a27d1d49a0bL335
Also, change single-dimension partitioning to allow partitions that have a size equal to the target or max size. Previously, it would create partitions up to 1 less than those limits. This change, makes the behavior consistent with the behavior of the row-count-based limits in DynamicPartitionsSpec.
Also, fix some IntelliJ inspection warnings in HadoopDruidIndexerConfig.
This PR has: