Use PartitionsSpec for all task types#8141
Conversation
clintropolis
left a comment
There was a problem hiding this comment.
this seems like a nice change 👍
| @JsonIgnore | ||
| boolean isDeterminingPartitions(); | ||
| /** | ||
| * Returns true if this partitionsSpec needs to determine the number of partitions to start data ingetsion. |
| } | ||
|
|
||
| public IndexTuningConfig withMaxRowsPerSegment(int maxRowsPerSegment) | ||
| public IndexTuningConfig withPartitoinsSpec(PartitionsSpec partitionsSpec) |
|
|
||
| if (addResult.isOk()) { | ||
| if (addResult.isPushRequired(tuningConfig)) { | ||
| final boolean isPushRequired = |
There was a problem hiding this comment.
Any reason not to push this down into the AppenderatorDriverAddResult like it was previously? Is the other AppenderatorDriverAddResult.isPushRequired method still legitimately used?
There was a problem hiding this comment.
Hmm, yeah it looks better to use the existing one. Reverted to use it.
| // If the number of rows in the segment exceeds the threshold after adding a row, | ||
| // move the segment out from the active segments of BaseAppenderatorDriver to make a new segment. | ||
| if (addResult.isPushRequired(tuningConfig) && !sequenceToUse.isCheckpointed()) { | ||
| final boolean isPushRequired = |
There was a problem hiding this comment.
same thing re isPushRequired
| this.partitionDimensions = partitionDimensions == null ? DEFAULT_PARTITION_DIMENSIONS : partitionDimensions; | ||
| Preconditions.checkArgument( | ||
| PartitionsSpec.isEffectivelyNull(maxRowsPerSegment) || PartitionsSpec.isEffectivelyNull(numShards), | ||
| "Can't use maxRowsPerSegment and numShards together" |
There was a problem hiding this comment.
When this is called through HadoopHashedPartitionsSpec, the field there is called targetPartitionSize instead of maxRowsPerSegment, so it might be clearer to indicate that name here as well in addition to maxRowsPerSegment
| PartitionsSpec.isEffectivelyNull(maxRowsPerSegment) || PartitionsSpec.isEffectivelyNull(numShards), | ||
| "Can't use maxRowsPerSegment and numShards together" | ||
| ); | ||
| // Needs to determin partitions if the _given_ numShards is null |
| Preconditions.checkArgument(maxRowsPerSegment > 0, "maxRowsPerSegment must be specified"); | ||
| this.maxRowsPerSegment = maxRowsPerSegment; | ||
| this.maxPartitionSize = PartitionsSpec.isEffectivelyNull(maxPartitionSize) | ||
| ? Math.multiplyExact(maxRowsPerSegment, (int) (maxRowsPerSegment * 0.5)) |
There was a problem hiding this comment.
Is this calculation right? maxPartitionSize previously defaulted to 50% more than targetPartitionSize - this is targetPartitionSize * targetPartitionSize * 0.5
There was a problem hiding this comment.
Nice finding! This should be Math.addExact(maxRowsPerSegment, (int) (maxRowsPerSegment * 0.5)). Thanks.
| } else { | ||
| if (forceGuaranteedRollup) { | ||
| if (!(partitionsSpec instanceof HashedPartitionsSpec)) { | ||
| throw new ISE("HashedPartitonsSpec must be used for perfect rollup"); |
|
👍 after conflict resolved |
|
@dclim @clintropolis thank you for the review! |
Part of #8061.
Description
PartitionsSpecis a class to describe the secondary partitioning method for data ingestion, but is being used by only Hadoop tasks. For more consistent behavior and configuration, all task types should use the samePartitionsSpec.PartitionsSpecis the top interface and has one direct implementation,DynamicPartitionsSpec.DynamicPartitionsSpecis the new partitionsSpec and used by indexTask and kafka/kinesis IndexTasks.DimensionBasedPartitionsSpecis the child interface ofPartitionsSpecand represents the partitionsSpec based on dimension values. It has two implementations ofHashedPartitionsSpecandSingleDimensionPartitionsSpec. These partitionsSpecs are used if and only if perfect rollup is configured.This PR is backward-Incompatible for tasks which useindexTuningConfig(indexTask, compactionTask, and parallelIndexTask) because the JSON form of tuningConfig doesn't havemaxRowsPerSegment,maxTotalRows,numShards, andpartitionDimensionsanymore. However, it still could read the old JSON format. It should be compatible for other task types.This PR has: