Skip to content

Use hash of Segment IDs instead of a list of explicit segments in auto compaction#8571

Merged
gianm merged 7 commits intoapache:masterfrom
jihoonson:compaction-io-config-master
Oct 9, 2019
Merged

Use hash of Segment IDs instead of a list of explicit segments in auto compaction#8571
gianm merged 7 commits intoapache:masterfrom
jihoonson:compaction-io-config-master

Conversation

@jihoonson
Copy link
Copy Markdown
Contributor

@jihoonson jihoonson commented Sep 23, 2019

Description

Currently when the coordinator issues a compaction task in auto compaction, it specifies a list of segments to compact explicitly. The list of segments is used to validate the given segments are still the most recent segments in compaction task.

This could lead to a very large compaction task spec which could be larger than the max znode size of ZooKeeper. To avoid this problem, auto compaction supports a configuration of maxNumSegmentsToCompact which limits the number of segments to compact together at the same time. However, with this way, the auto compaction has a limitation that it cannot compact an interval if there are too many segments.

This PR is to avoid this issue by using a hash of segment IDs instead of the list of segments for validating input segments. The below changes are also included.

New IOConfig for compaction task

Compaction task now requires an ioConfig. You can set inputSpec in the ioConfig. An example ioConfig is:

  "ioConfig" : {
    "type": "compact",
    "inputSpec": {
      "type": "interval",
      "interval": "2017-01-01/2018-01-01"
    }
  }

There are two types of inputSpecs, i.e., interval and segments, for now.

    "inputSpec": {
      "type": "interval",
      "interval": "2017-01-01/2018-01-01"
    }
    "inputSpec": {
      "type": "segments",
      "segments": ["segmentId1", "segmentId2", ...]
    }

Using interval inputSpec for auto compaction

Auto compaction used to specify all segments to compact explicitly in the compaction task spec. Now it always uses the interval inputSpec instead. maxNumSegmentsToCompact was dropped as well.


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added unit tests or modified existing tests to cover new code paths.
  • been tested in a test Druid cluster.

@PublicApi
public class SegmentUtils
{
private static final HashFunction HASH_FUNCTION = Hashing.sha256();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the hash does not need to be cryptographically secure, perhaps Hashing.murmur3_128() is a better option.

(If you change the hash function, the comment on 49 and CompactionIntervalSpec need to be updated too.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, why is it better?

Copy link
Copy Markdown
Contributor

@ccaominh ccaominh Oct 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cryptographically-secure hash functions are typically slower since they need to more work to achieve that property. For example, this perf test observed sha256 to be about an order of magnitude slower than murmur3: https://rusty.ozlabs.org/?p=511

Comment thread docs/ingestion/data-management.md Outdated

### Compaction IOConfig

The compaction IOConfig requires to specify `inputSpec` as seen below.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: to specify -> specifying

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks.

Comment on lines +200 to 210
if (ioConfig != null) {
this.ioConfig = ioConfig;
} else {
if (interval != null) {
this.ioConfig = new CompactionIOConfig(new CompactionIntervalSpec(interval, null));
} else if (segments != null && !segments.isEmpty()) {
this.ioConfig = new CompactionIOConfig(SpecificSegmentsSpec.fromSegments(segments));
} else {
throw new IAE("Missing ioConfig");
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior is bit different from before (e.g., it allows both interval and segments to be not null). Perhaps it's better to enforce that exactly one of interval, segments, or ioConfig is not null, which has similar behavior to the old code.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

import java.util.Objects;

@JsonTypeName("compact")
public class CompactionIOConfig implements IOConfig
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to add a javadoc for the class?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

import java.util.Objects;
import java.util.stream.Collectors;

public class SpecificSegmentsSpec implements CompactionInputSpec
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to add a javadoc for the class?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this class is pretty obvious.

}

@JsonCreator
public SpecificSegmentsSpec(@JsonProperty("segments") List<String> segments)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a serde test for this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's in CompactionTaskTest.

Assert.assertEquals(expected.getIoConfig(), actual.getIoConfig());
Assert.assertEquals(expected.getDimensionsSpec(), actual.getDimensionsSpec());
Assert.assertTrue(Arrays.equals(expected.getMetricsSpec(), actual.getMetricsSpec()));
Assert.assertArrayEquals(expected.getMetricsSpec(), actual.getMetricsSpec());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!


import java.util.Objects;

public class ClientCompactionIOConfig
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to add a javadoc for the class?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

/**
* Specifying an interval to compact. A hash of the segment IDs can be optionally provided for segment validation.
*/
public class CompactionIntervalSpec implements CompactionInputSpec
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is almost identical to ClientCompactionIntervalSpec. Is there a way to reuse the code?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are in different modules. Added javadoc.

private final CompactionInputSpec inputSpec;

@JsonCreator
public CompactionIOConfig(@JsonProperty("inputSpec") CompactionInputSpec inputSpec)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a serde test for this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's in CompactionTaskTest.

@jihoonson
Copy link
Copy Markdown
Contributor Author

@ccaominh thank you for the review! Addressed most of them and left some questions.

Comment thread docs/ingestion/data-management.md Outdated
### Compaction IOConfig

The compaction IOConfig requires to specify `inputSpec` as seen below.
The compaction IOConfig requires to specifying `inputSpec` as seen below.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: to specifying -> specifying

*/

package org.apache.druid.indexer.partitions;
package org.apache.druid.indexer;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Package for ChecksTest needs to be updated similarly

class Checks
public final class Checks
{
public static <T> Property<T> checkOneNotNullOrEmpty(List<Property<T>> properties)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add relevant tests to ChecksTest

Copy link
Copy Markdown
Contributor

@ccaominh ccaominh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

Copy link
Copy Markdown
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤘

/**
* InputSpec for {@link ClientCompactionIOConfig}.
*
* Should be synchronized with org.apache.druid.indexing.common.task.CompactionIntervalSpec.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😢


@Nullable
@JsonProperty
public String getSha256OfSortedSegmentIds()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: not personally super into this name, i feel like it makes me unnecessarily care about how the segment ids were hashed, but doesn't really matter i guess since this isn't so much directly used or directly constructed by users.

@gianm gianm added this to the 0.17.0 milestone Oct 9, 2019
@gianm gianm merged commit 96d8523 into apache:master Oct 9, 2019
@jon-wei jon-wei mentioned this pull request Dec 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants