Option to configure default analysis types in SegmentMetadataQuery by gkc2104 · Pull Request #4259 · apache/druid

gkc2104 · 2017-05-09T04:13:04Z

Now that defaultAnalysisTypes is being updated, it would make it easier for users to update to 0.10.0 if they could easily revert to old behavior without having to update all SegmentMetadataQueries they use.

Suggest not only providing an option to revert to old behavior but also be able to specify their own defaultAnalysisTypes using druid.query.segmentMetadata.defaultAnalysisType.

leventov · 2017-05-09T04:42:07Z

  private Period defaultHistory = ISO_FORMATTER.parsePeriod(DEFAULT_PERIOD_STRING);

+  @JsonProperty
+  private EnumSet<SegmentMetadataQuery.AnalysisType> defaultAnalysisType = DEFAULT_ANALYSIS_TYPES;


Suggested defaultAnalysisTypes (update getter and setter names too)

gianm

I think with the way this is currently written, all nodes involved in a query (broker, historical, etc) will apply their own defaultAnalysisTypes, meaning we'll get weird behavior if different nodes have different values. To fix that, the broker should rewrite the query to include explicit analysis types, to prevent the historicals/etc from filling in their own (possibly different) set. If that's already happening somewhere then I missed it.

One place that could be done is in SegmentMetadataQueryQueryToolChest.mergeResults, it gets called pretty early in the query pipeline and can do rewrites like that. If you make this change then you should be able to revert the cache strategy back to the old code, since the caching layer won't activate until after the query is rewritten, and it can assume analysisTypes is set to an explicit value. As a sanity check you could have the cache strategy throw some kind of IllegalArgumentException if it's called on a query where analysisTypes is still null.

gianm · 2017-05-09T16:24:03Z

+  public EnumSet<SegmentMetadataQuery.AnalysisType> getAnalysisTypes(SegmentMetadataQuery query)
+  {
+    if (query.getAnalysisTypes() == null) {
+      return config != null ? config.getDefaultAnalysisTypes() : SegmentMetadataQueryConfig.DEFAULT_ANALYSIS_TYPES;


It seems like handling null config shouldn't be necessary here. It shouldn't be null in production. If it is null sometimes in testing, how about replacing those spots with new SegmentMetadataQueryConfig() to get a default config? Then, SegmentMetadataQueryConfig.DEFAULT_ANALYSIS_TYPES could be private.

gkc2104 · 2017-05-09T18:43:44Z

Given that the config is specific to the broker's runtime.properties, would other nodes like historical's be able to have their own SegmentMetadataQueryConfig, as this where it gets set, and that would also mean that they would be using different ToolChests as well, is it intended for them to be able to set different defaultHistory as well.

It has been reimplemented to not be reliant on the query for analysisTypes in SegmentMetadataQueryQueryToolChest , I extract it from the ToolChest in
https://github.com/metamx/druid/blob/defaultAnalysisUpdate/processing/src/main/java/io/druid/query/metadata/SegmentMetadataQueryRunnerFactory.java#L93
and use that at all the places analysisTypes is required in the QueryRunnerFactory, to rely on analysisTypes of the config, if not in the query.

leventov · 2017-05-09T21:49:55Z

 |Property|Description|Default|
 |--------|-----------|-------|
 |`druid.query.segmentMetadata.defaultHistory`|When no interval is specified in the query, use a default interval of defaultHistory before the end time of the most recent segment, specified in ISO8601 format. This property also controls the duration of the default interval used by GET /druid/v2/datasources/{dataSourceName} interactions for retrieving datasource dimensions/metrics.|P1W|
+|`druid.query.segmentMetadata.defaultAnalysisTypes`|This can be used to set the Default Analysis Types for all segment metadata queries, this can be overridden when making the query|[CARDINALITY, INTERVAL, MINMAX]|


The default value here should be updated to ["cardinality", "interval", "minmax"]. Also please point to this config in docs of Segment Metadata Queries, line 35.

leventov · 2017-05-10T00:15:15Z

@fjy this PR actually blocks us from updating to 0.10.0 (even not 0.10.1), so it fits the criteria that I suggested before, unlike many other PRs that currently target 0.10.1, and don't block anybody from updating to 0.10.0.

drcrallen · 2017-05-10T00:44:37Z

I think @gianm is proposing that the broker overwrite the analysis types if null is passed in, and if someone sets druid.query.segmentMetadata.defaultAnalysisType differently on historicals than brokers, then the broker setting should be enforced. If someone queries historicals directly, then caveat emptor. Or in other words, druid.query.segmentMetadata.defaultAnalysisType should never do anything on historicals unless someone is querying the historicals directly.

gianm · 2017-05-10T08:20:17Z

Yeah, I meant what @drcrallen said.

In particular I'm suggesting that if the broker receives a query without analysisTypes set, it should rewrite the query to include analysisTypes before it passes it down to other nodes. Otherwise different nodes involved in the query might not agree on what the analysisTypes are, which could cause weird results.

…ltAnalysisUpdate

gkc2104 · 2017-05-11T00:57:25Z

I'm not sure if this was the best way of doing it, but I made the analysisType field of SegmentMetadatQuery not final, and provided a setter, that is used to updated the query in all calls of query in the overridden methods of QueryToolChest not only mergeResults, and also defaulted the cache strategy and other methods that relied on analysis Tupe to be provided by the query

leventov · 2017-05-11T00:58:31Z

    return analysisTypes;
  }

+  public void setAnalysisTypes(EnumSet<AnalysisType> analysisTypes)


Query object should be immutable. Please replace this method with "withAnalysisTypes()", returning a new query object with the given analysis types. Also update SegmentMetadataQueryBuilder, to include a new field.

leventov · 2017-05-11T03:21:03Z

      {
-        Query<SegmentAnalysis> query = queryPlus.getQuery();
+        SegmentMetadataQuery castedQuery = (SegmentMetadataQuery) queryPlus.getQuery();
+        SegmentMetadataQuery updatedQuery =(SegmentMetadataQuery) (castedQuery.withAnalysisTypes(getFinalAnalysisTypes(castedQuery)));


withAnalysisTypes() could return SegmentMetadataQuery and casting not needed.

Suggested to join those methods into a single method like SegmentMetadataQuery.withFinalizedAnalysisTypes(toolChest), for less boilerplate, because currently they are always used together. Also this method may avoid creating a new query object and return the same instance, if analysisTypes are already set.

leventov · 2017-05-11T03:26:56Z

+
+  public EnumSet<SegmentMetadataQuery.AnalysisType> getFinalAnalysisTypes(SegmentMetadataQuery query)
+  {
+    if (query.getAnalysisTypes() == null) {


Suggested Objects.firstNonNull()

leventov · 2017-05-11T03:31:36Z

        return new BinaryFn<SegmentAnalysis, SegmentAnalysis, SegmentAnalysis>()
        {
-          private final SegmentMetadataQuery query = (SegmentMetadataQuery) inQ;
+          private final SegmentMetadataQuery query = updatedQuery;


Why this field is needed?

leventov · 2017-05-11T03:37:45Z

@@ -121,7 +124,10 @@ public Sequence<SegmentAnalysis> doRun(
      @Override
      protected Ordering<SegmentAnalysis> makeOrdering(Query<SegmentAnalysis> query)


There seem to be no point for this anonymous QueryRunner to extend ResultMergeQueryRunner rather than BySegmentSkippingQueryRunner directly, because it overrides doRun(). Then this QueryRunner doesn't need to follow API of ResultMergeQueryRunner, it may accept SegmentMetadataQuery instead of Query<SegmentAnalysis> in makeOrdering() and createMergeFn(), to avoid casting.

leventov · 2017-05-11T03:38:11Z

      {
-        if (((SegmentMetadataQuery) query).isMerge()) {
+        SegmentMetadataQuery castedQuery = (SegmentMetadataQuery) query;
+        SegmentMetadataQuery updatedQuery =(SegmentMetadataQuery) (castedQuery.withAnalysisTypes(getFinalAnalysisTypes(castedQuery)));


This is already done in upstream doRun() call.

leventov · 2017-05-11T03:41:34Z

+      return this;
+    }
+
+


leventov · 2017-05-11T03:41:43Z

  @Override
  public CacheStrategy<SegmentAnalysis, SegmentAnalysis, SegmentMetadataQuery> getCacheStrategy(final SegmentMetadataQuery query)
  {
+


leventov · 2017-05-11T03:42:18Z

    {
      @Override
-      public boolean isCacheable(SegmentMetadataQuery query, boolean willMergeRunners)
+      public boolean isCacheable(SegmentMetadataQuery updatedQuery, boolean willMergeRunners)


Shouldn't be renamed?

leventov · 2017-05-11T03:44:07Z

  public <T extends LogicalSegment> List<T> filterSegments(SegmentMetadataQuery query, List<T> segments)
  {
-    if (!query.isUsingDefaultInterval()) {
+    SegmentMetadataQuery updatedQuery = (SegmentMetadataQuery) query.withAnalysisTypes(getFinalAnalysisTypes(query));


This is pointless, because doesn't affect isUsingDefaultInterval

leventov · 2017-05-11T03:57:47Z

    private Boolean merge;
    private Boolean lenientAggregatorMerge;
    private Map<String, Object> context;
+    private Boolean usingDefaultInterval;


useDefaultInterval seems to be an unnecessary configuration, that allows to create inconsistency, if you pass useDefaultInterval=false, and querySegmentSpec which actually represents the default interval to SegmentMetadataQuery constructor.

I suggest the following plan:

Don't add usingDefaultInterval to this builder

Leave usingDefaultInterval parameter of SegmentMetadataQuery for compatibility, but ignore it, and document the fact that it is going to be removed.

In the constructor, set usingDefaultInterval=true if querySegmentSpec == null or querySegmentSpec is not null, and it has just one interval which is equal to the default interval.

leventov · 2017-05-11T21:59:41Z

        return new BinaryFn<SegmentAnalysis, SegmentAnalysis, SegmentAnalysis>()
        {
-          private final SegmentMetadataQuery query = (SegmentMetadataQuery) inQ;
+          private final SegmentMetadataQuery query = inQ;


This field is not needed

leventov · 2017-05-11T22:00:00Z


-      @Override
-      protected Ordering<SegmentAnalysis> makeOrdering(Query<SegmentAnalysis> query)
+      protected Ordering<SegmentAnalysis> makeOrdering(SegmentMetadataQuery query)


Could be private

leventov · 2017-05-11T22:00:12Z


-      @Override
-      protected BinaryFn<SegmentAnalysis, SegmentAnalysis, SegmentAnalysis> createMergeFn(final Query<SegmentAnalysis> inQ)
+      protected BinaryFn<SegmentAnalysis, SegmentAnalysis, SegmentAnalysis> createMergeFn(final SegmentMetadataQuery inQ)


Could be private

leventov · 2017-05-11T22:01:22Z

+import io.druid.query.metadata.SegmentMetadataQueryConfig;
 import io.druid.query.spec.MultipleIntervalSegmentSpec;
 import io.druid.query.spec.QuerySegmentSpec;
+import jdk.nashorn.internal.ir.annotations.Ignore;


Don't use unrelated annotation

leventov · 2017-05-11T22:02:17Z

      @JsonProperty("context") Map<String, Object> context,
      @JsonProperty("analysisTypes") EnumSet<AnalysisType> analysisTypes,
-      @JsonProperty("usingDefaultInterval") Boolean useDefaultInterval,
+      @Ignore @JsonProperty("usingDefaultInterval") Boolean useDefaultInterval,


Please document that this parameter is ignored in a simple comment, also add note that it is going to be removed, and left now only for the sake of compatibility.

leventov · 2017-05-11T22:02:37Z

      this.usingDefaultInterval = true;
    } else {
-      this.usingDefaultInterval = useDefaultInterval == null ? false : useDefaultInterval;
+      if (querySegmentSpec.getIntervals().size() == 1 && querySegmentSpec.getIntervals()


Missing .get(0)?

Also please add a test for this

Please address this. Also test that if the default interval is explicitly specified via query builder (rather than leaving it as null, that will be replaced with the default interval in the SegmentMetadataQuery's constructor), than isDefaultInterval is correctly set to true.

Ok, I somehow missed .get(0) on the next line. The formatting is strange: suggested

if (querySegmentSpec.getIntervals().size() == 1 && querySegmentSpec.getIntervals().get(0).equals(DEFAULT_INTERVAL)) {

leventov · 2017-05-11T22:03:34Z


+  public SegmentMetadataQuery withFinalizedAnalysisTypes(SegmentMetadataQueryConfig config)
+  {
+    return Druids.SegmentMetadataQueryBuilder


Don't create a new query object if analysisTypes are already non-null, just return this

leventov · 2017-05-11T22:08:29Z

@gkc2104 BTW is this valid that SegmentMetadataQueryQueryToolChest.filterSegments() doesn't use intervals of the query?

gkc2104 · 2017-05-11T23:19:34Z

@leventov Agreed, it is building a new interval using Default History and the max segment time.
I added a failing test that shows this as well, If you feel that the filtering is being handled incorrectly, I could update it in this PR itself

gkc2104 · 2017-05-12T01:52:29Z

    for (int i = 0; i < filteredSegments2.size(); i++) {
      Assert.assertEquals(expectedSegments2.get(i).getInterval(), filteredSegments2.get(i).getInterval());
    }
+


@leventov added the isUsingDefaultInterval in here along with a failing test (because of filterSegments implementation). What should I do about filterSegments though ?

…ltAnalysisUpdate

gkc2104 · 2017-05-12T21:33:32Z

Updated filterSegments implementation to take into consideration the query's intervals, and not return everything after DefaultHistory and nothing before even if overridden by the query.

leventov · 2017-05-15T22:57:28Z

-    final Interval targetInterval = new Interval(config.getDefaultHistory(), targetEnd);
+    List<Interval> intervals = query.getIntervals();
+
+    DateTime queryStartTime = JodaUtils.ETERNITY.getEnd();


In other parts of the Druid codebase new DateTime(JodaUtils.MAX_INSTANT) is used

leventov · 2017-05-15T22:58:11Z

+    List<Interval> intervals = query.getIntervals();
+
+    DateTime queryStartTime = JodaUtils.ETERNITY.getEnd();
+    DateTime queryEndTIme = JodaUtils.ETERNITY.getStart();


queryEndTime

leventov · 2017-05-15T23:09:11Z

+    DateTime queryStartTime = JodaUtils.ETERNITY.getEnd();
+    DateTime queryEndTIme = JodaUtils.ETERNITY.getStart();
+
+    for (Interval interval : intervals) {


Consider

queryStartTime = intervals.stream().map(Interval::getEnd) .max(Ordering.natural()).orElseThrow(IllegalStateException::new);

…date

leventov · 2017-05-16T22:39:01Z

@fjy restoring 0.10.1 milestone because it's an important change that restores compatibility of segment metadata queries.

@gianm @drcrallen could you please review the design again?

gianm

Looks good except for the defaultInterval stuff. I'm not sure how that got roped into this patch, but I think it's fine the way it is and doesn't need to be changed.

@leventov wrote,

useDefaultInterval seems to be an unnecessary configuration, that allows to create inconsistency, if you pass useDefaultInterval=false, and querySegmentSpec which actually represents the default interval to SegmentMetadataQuery constructor.

I suggest the following plan:

Don't add usingDefaultInterval to this builder

Leave usingDefaultInterval parameter of SegmentMetadataQuery for compatibility, but ignore it, and document the fact that it is going to be removed.

In the constructor, set usingDefaultInterval=true if querySegmentSpec == null or querySegmentSpec is not null, and it has just one interval which is equal to the default interval.

See #1732 (comment) for the original rationale about why this exists. I guess it was confusing, so we should add a comment, but it does serve a purpose.

It definitely shouldn't be something a user should be able to provide in the builder. Its only purpose is to help "remember" whether or not a user provided intervals. The idea is we wanted providing null and providing the default interval explicitly to have different behavior. null does druid.query.segmentMetadata.defaultHistory prior to maxTime; explicitly specifying the interval from DEFAULT_INTERVAL will do exactly what the user asked for, and use that specific interval, not the one based on druid.query.segmentMetadata.defaultHistory.

gianm · 2017-05-23T09:37:42Z

@@ -235,18 +236,27 @@ public SegmentAnalysis apply(@Nullable SegmentAnalysis input)
  @Override
  public <T extends LogicalSegment> List<T> filterSegments(SegmentMetadataQuery query, List<T> segments)


What's the purpose of the changes in this method? They don't seem related to the rest of the patch.

Yes, these are unrelated to the changes proposed for this patch, I'll open up another PR for this.
The current implementation does not take into consideration the time range while filtering, I added an improvement to only include segments that are within the intervals mentioned by the query.

The current implementation does not take into consideration the time range while filtering, I added an improvement to only include segments that are within the intervals mentioned by the query.

filterSegments doesn't need to worry about that, since the list of segments it gets has already been pruned down to whatever matches the intervals filter. In fact many implementations just return segments; with no changes -- check out timeseries for an example. filterSegments only has to potentially filter it down even further, if it wants (see timeBoundary for another example of that). I suppose the javadocs for filterSegments could be more clear on this.

Please also see #4259 (review) for rationale about the current behavior of defaultInterval.

I was under the impression that this method was called to prune the segments to include only the required ones from a list of all segments, didn't know it was already pruned. Now that I look at the examples you suggested I understand why it is required even if we already prune it.

Reg: defaultInterval - I had some failing tests that were because of defaultInterval, I don't remember the exact test case, but reverting to the old behavior did not break any tests (probably the change of not having null configs helped I guess), so we can keep the implementation on master

I excluded these changes from this PR.

gianm · 2017-05-23T09:40:35Z

      {
        SegmentMetadataQuery query = (SegmentMetadataQuery) inQ.getQuery();
-        final SegmentAnalyzer analyzer = new SegmentAnalyzer(query.getAnalysisTypes());
+        SegmentMetadataQuery updatedQuery = query.withFinalizedAnalysisTypes(toolChest.getConfig());


Might as well do this and the cast in one line to avoid accidentally referencing query instead of updatedQuery.

gianm · 2017-05-23T09:41:35Z

      @JsonProperty("merge") Boolean merge,
      @JsonProperty("context") Map<String, Object> context,
      @JsonProperty("analysisTypes") EnumSet<AnalysisType> analysisTypes,
+      // useDefaultInterval will be removed, but is left for now for compatibility


What's wrong with useDefaultInterval?

… into defaultAnalysisUpdate

gianm

LGTM.

gianm · 2017-05-25T02:25:47Z

Needs one more design review.

Option to configure default analysis types

ddb6565

leventov approved these changes May 9, 2017

View reviewed changes

leventov added this to the 0.10.1 milestone May 9, 2017

leventov added Design Review Improvement labels May 9, 2017

leventov reviewed May 9, 2017

View reviewed changes

Updated Docs and renamed

4103a6c

leventov approved these changes May 9, 2017

View reviewed changes

gianm reviewed May 9, 2017

View reviewed changes

Added serde tests and Null handling

9891b09

leventov reviewed May 9, 2017

View reviewed changes

fjy removed this from the 0.10.1 milestone May 9, 2017

Fixed Documentation

81e7464

gkc2104 added 2 commits May 10, 2017 17:38

Updated implementation

6c9cde2

Merge branch 'master' of https://github.com/druid-io/druid into defau…

a756d4f

…ltAnalysisUpdate

leventov requested changes May 11, 2017

View reviewed changes

gkc2104 added 3 commits May 10, 2017 18:11

Updated implementation

dfacab1

Updated implementation

98db5a7

Added usingDefaultIntervals in Builder

7313175

leventov requested changes May 11, 2017

View reviewed changes

Updated implementation

6612a4a

leventov requested changes May 11, 2017

View reviewed changes

Updated implementation and added failing test

639661f

gkc2104 commented May 12, 2017

View reviewed changes

gkc2104 added 2 commits May 12, 2017 14:29

filterSegments implementation updated

cb98ac3

Merge branch 'master' of https://github.com/druid-io/druid into defau…

5529f6a

…ltAnalysisUpdate

leventov requested changes May 15, 2017

View reviewed changes

gkc2104 and others added 2 commits May 15, 2017 17:55

Updated imlementation

b544694

Padding

f68927e

leventov approved these changes May 16, 2017

View reviewed changes

Merge remote-tracking branch 'upstream/master' into defaultAnalysisUp…

aee4da9

…date

leventov added the Compatibility label May 16, 2017

Add missing Override

6e1a11c

leventov added this to the 0.10.1 milestone May 16, 2017

gianm reviewed May 23, 2017

View reviewed changes

gkc2104 added 6 commits May 24, 2017 17:52

Updated implementation

bd2324f

Hmm

1871c92

Fixed a naming bug

6433796

Fixed bug

7096d9a

Merge branch 'defaultAnalysisUpdate' of https://github.com/metamx/druid…

8aecbde

… into defaultAnalysisUpdate

Removed comment

b7f586c

gianm approved these changes May 25, 2017

View reviewed changes

drcrallen approved these changes May 26, 2017

View reviewed changes

drcrallen merged commit dcb07d6 into apache:master May 26, 2017

drcrallen deleted the defaultAnalysisUpdate branch May 26, 2017 19:12

		@@ -121,7 +124,10 @@ public Sequence<SegmentAnalysis> doRun(
		@Override
		protected Ordering<SegmentAnalysis> makeOrdering(Query<SegmentAnalysis> query)

		@@ -235,18 +236,27 @@ public SegmentAnalysis apply(@Nullable SegmentAnalysis input)
		@Override
		public <T extends LogicalSegment> List<T> filterSegments(SegmentMetadataQuery query, List<T> segments)

Conversation

gkc2104 commented May 9, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gianm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gkc2104 commented May 9, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leventov commented May 10, 2017

Uh oh!

drcrallen commented May 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gianm commented May 10, 2017

Uh oh!

gkc2104 commented May 11, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leventov commented May 11, 2017

Uh oh!

gkc2104 commented May 11, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gkc2104 commented May 12, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

drcrallen commented May 10, 2017 •

edited

Loading