Use hash of Segment IDs instead of a list of explicit segments in auto compaction#8571
Conversation
| @PublicApi | ||
| public class SegmentUtils | ||
| { | ||
| private static final HashFunction HASH_FUNCTION = Hashing.sha256(); |
There was a problem hiding this comment.
If the hash does not need to be cryptographically secure, perhaps Hashing.murmur3_128() is a better option.
(If you change the hash function, the comment on 49 and CompactionIntervalSpec need to be updated too.)
There was a problem hiding this comment.
Hmm, why is it better?
There was a problem hiding this comment.
Cryptographically-secure hash functions are typically slower since they need to more work to achieve that property. For example, this perf test observed sha256 to be about an order of magnitude slower than murmur3: https://rusty.ozlabs.org/?p=511
|
|
||
| ### Compaction IOConfig | ||
|
|
||
| The compaction IOConfig requires to specify `inputSpec` as seen below. |
There was a problem hiding this comment.
Suggestion: to specify -> specifying
| if (ioConfig != null) { | ||
| this.ioConfig = ioConfig; | ||
| } else { | ||
| if (interval != null) { | ||
| this.ioConfig = new CompactionIOConfig(new CompactionIntervalSpec(interval, null)); | ||
| } else if (segments != null && !segments.isEmpty()) { | ||
| this.ioConfig = new CompactionIOConfig(SpecificSegmentsSpec.fromSegments(segments)); | ||
| } else { | ||
| throw new IAE("Missing ioConfig"); | ||
| } | ||
| } |
There was a problem hiding this comment.
This behavior is bit different from before (e.g., it allows both interval and segments to be not null). Perhaps it's better to enforce that exactly one of interval, segments, or ioConfig is not null, which has similar behavior to the old code.
| import java.util.Objects; | ||
|
|
||
| @JsonTypeName("compact") | ||
| public class CompactionIOConfig implements IOConfig |
There was a problem hiding this comment.
Do you want to add a javadoc for the class?
| import java.util.Objects; | ||
| import java.util.stream.Collectors; | ||
|
|
||
| public class SpecificSegmentsSpec implements CompactionInputSpec |
There was a problem hiding this comment.
Do you want to add a javadoc for the class?
There was a problem hiding this comment.
I think this class is pretty obvious.
| } | ||
|
|
||
| @JsonCreator | ||
| public SpecificSegmentsSpec(@JsonProperty("segments") List<String> segments) |
There was a problem hiding this comment.
Is there a serde test for this?
There was a problem hiding this comment.
Yes, it's in CompactionTaskTest.
| Assert.assertEquals(expected.getIoConfig(), actual.getIoConfig()); | ||
| Assert.assertEquals(expected.getDimensionsSpec(), actual.getDimensionsSpec()); | ||
| Assert.assertTrue(Arrays.equals(expected.getMetricsSpec(), actual.getMetricsSpec())); | ||
| Assert.assertArrayEquals(expected.getMetricsSpec(), actual.getMetricsSpec()); |
|
|
||
| import java.util.Objects; | ||
|
|
||
| public class ClientCompactionIOConfig |
There was a problem hiding this comment.
Do you want to add a javadoc for the class?
| /** | ||
| * Specifying an interval to compact. A hash of the segment IDs can be optionally provided for segment validation. | ||
| */ | ||
| public class CompactionIntervalSpec implements CompactionInputSpec |
There was a problem hiding this comment.
This is almost identical to ClientCompactionIntervalSpec. Is there a way to reuse the code?
There was a problem hiding this comment.
They are in different modules. Added javadoc.
| private final CompactionInputSpec inputSpec; | ||
|
|
||
| @JsonCreator | ||
| public CompactionIOConfig(@JsonProperty("inputSpec") CompactionInputSpec inputSpec) |
There was a problem hiding this comment.
Is there a serde test for this?
There was a problem hiding this comment.
Yes, it's in CompactionTaskTest.
|
@ccaominh thank you for the review! Addressed most of them and left some questions. |
| ### Compaction IOConfig | ||
|
|
||
| The compaction IOConfig requires to specify `inputSpec` as seen below. | ||
| The compaction IOConfig requires to specifying `inputSpec` as seen below. |
There was a problem hiding this comment.
typo: to specifying -> specifying
| */ | ||
|
|
||
| package org.apache.druid.indexer.partitions; | ||
| package org.apache.druid.indexer; |
There was a problem hiding this comment.
Package for ChecksTest needs to be updated similarly
| class Checks | ||
| public final class Checks | ||
| { | ||
| public static <T> Property<T> checkOneNotNullOrEmpty(List<Property<T>> properties) |
There was a problem hiding this comment.
Please add relevant tests to ChecksTest
| /** | ||
| * InputSpec for {@link ClientCompactionIOConfig}. | ||
| * | ||
| * Should be synchronized with org.apache.druid.indexing.common.task.CompactionIntervalSpec. |
|
|
||
| @Nullable | ||
| @JsonProperty | ||
| public String getSha256OfSortedSegmentIds() |
There was a problem hiding this comment.
nit: not personally super into this name, i feel like it makes me unnecessarily care about how the segment ids were hashed, but doesn't really matter i guess since this isn't so much directly used or directly constructed by users.
Description
Currently when the coordinator issues a compaction task in auto compaction, it specifies a list of segments to compact explicitly. The list of segments is used to validate the given segments are still the most recent segments in compaction task.
This could lead to a very large compaction task spec which could be larger than the max znode size of ZooKeeper. To avoid this problem, auto compaction supports a configuration of
maxNumSegmentsToCompactwhich limits the number of segments to compact together at the same time. However, with this way, the auto compaction has a limitation that it cannot compact an interval if there are too many segments.This PR is to avoid this issue by using a hash of segment IDs instead of the list of segments for validating input segments. The below changes are also included.
New IOConfig for compaction task
Compaction task now requires an ioConfig. You can set
inputSpecin the ioConfig. An example ioConfig is:There are two types of
inputSpecs, i.e.,intervalandsegments, for now.Using interval inputSpec for auto compaction
Auto compaction used to specify all segments to compact explicitly in the compaction task spec. Now it always uses the
intervalinputSpec instead.maxNumSegmentsToCompactwas dropped as well.This PR has: