Option to configure default analysis types in SegmentMetadataQuery#4259
Option to configure default analysis types in SegmentMetadataQuery#4259drcrallen merged 23 commits intoapache:masterfrom
Conversation
| private Period defaultHistory = ISO_FORMATTER.parsePeriod(DEFAULT_PERIOD_STRING); | ||
|
|
||
| @JsonProperty | ||
| private EnumSet<SegmentMetadataQuery.AnalysisType> defaultAnalysisType = DEFAULT_ANALYSIS_TYPES; |
There was a problem hiding this comment.
Suggested defaultAnalysisTypes (update getter and setter names too)
gianm
left a comment
There was a problem hiding this comment.
I think with the way this is currently written, all nodes involved in a query (broker, historical, etc) will apply their own defaultAnalysisTypes, meaning we'll get weird behavior if different nodes have different values. To fix that, the broker should rewrite the query to include explicit analysis types, to prevent the historicals/etc from filling in their own (possibly different) set. If that's already happening somewhere then I missed it.
One place that could be done is in SegmentMetadataQueryQueryToolChest.mergeResults, it gets called pretty early in the query pipeline and can do rewrites like that. If you make this change then you should be able to revert the cache strategy back to the old code, since the caching layer won't activate until after the query is rewritten, and it can assume analysisTypes is set to an explicit value. As a sanity check you could have the cache strategy throw some kind of IllegalArgumentException if it's called on a query where analysisTypes is still null.
| public EnumSet<SegmentMetadataQuery.AnalysisType> getAnalysisTypes(SegmentMetadataQuery query) | ||
| { | ||
| if (query.getAnalysisTypes() == null) { | ||
| return config != null ? config.getDefaultAnalysisTypes() : SegmentMetadataQueryConfig.DEFAULT_ANALYSIS_TYPES; |
There was a problem hiding this comment.
It seems like handling null config shouldn't be necessary here. It shouldn't be null in production. If it is null sometimes in testing, how about replacing those spots with new SegmentMetadataQueryConfig() to get a default config? Then, SegmentMetadataQueryConfig.DEFAULT_ANALYSIS_TYPES could be private.
|
Given that the config is specific to the broker's It has been reimplemented to not be reliant on the query for analysisTypes in |
| |Property|Description|Default| | ||
| |--------|-----------|-------| | ||
| |`druid.query.segmentMetadata.defaultHistory`|When no interval is specified in the query, use a default interval of defaultHistory before the end time of the most recent segment, specified in ISO8601 format. This property also controls the duration of the default interval used by GET /druid/v2/datasources/{dataSourceName} interactions for retrieving datasource dimensions/metrics.|P1W| | ||
| |`druid.query.segmentMetadata.defaultAnalysisTypes`|This can be used to set the Default Analysis Types for all segment metadata queries, this can be overridden when making the query|[CARDINALITY, INTERVAL, MINMAX]| |
There was a problem hiding this comment.
The default value here should be updated to ["cardinality", "interval", "minmax"]. Also please point to this config in docs of Segment Metadata Queries, line 35.
|
@fjy this PR actually blocks us from updating to 0.10.0 (even not 0.10.1), so it fits the criteria that I suggested before, unlike many other PRs that currently target 0.10.1, and don't block anybody from updating to 0.10.0. |
|
I think @gianm is proposing that the broker overwrite the analysis types if null is passed in, and if someone sets |
|
Yeah, I meant what @drcrallen said. In particular I'm suggesting that if the broker receives a query without analysisTypes set, it should rewrite the query to include analysisTypes before it passes it down to other nodes. Otherwise different nodes involved in the query might not agree on what the analysisTypes are, which could cause weird results. |
|
I'm not sure if this was the best way of doing it, but I made the |
| return analysisTypes; | ||
| } | ||
|
|
||
| public void setAnalysisTypes(EnumSet<AnalysisType> analysisTypes) |
There was a problem hiding this comment.
Query object should be immutable. Please replace this method with "withAnalysisTypes()", returning a new query object with the given analysis types. Also update SegmentMetadataQueryBuilder, to include a new field.
| { | ||
| Query<SegmentAnalysis> query = queryPlus.getQuery(); | ||
| SegmentMetadataQuery castedQuery = (SegmentMetadataQuery) queryPlus.getQuery(); | ||
| SegmentMetadataQuery updatedQuery =(SegmentMetadataQuery) (castedQuery.withAnalysisTypes(getFinalAnalysisTypes(castedQuery))); |
There was a problem hiding this comment.
withAnalysisTypes() could return SegmentMetadataQuery and casting not needed.
There was a problem hiding this comment.
Suggested to join those methods into a single method like SegmentMetadataQuery.withFinalizedAnalysisTypes(toolChest), for less boilerplate, because currently they are always used together. Also this method may avoid creating a new query object and return the same instance, if analysisTypes are already set.
|
|
||
| public EnumSet<SegmentMetadataQuery.AnalysisType> getFinalAnalysisTypes(SegmentMetadataQuery query) | ||
| { | ||
| if (query.getAnalysisTypes() == null) { |
There was a problem hiding this comment.
Suggested Objects.firstNonNull()
| return new BinaryFn<SegmentAnalysis, SegmentAnalysis, SegmentAnalysis>() | ||
| { | ||
| private final SegmentMetadataQuery query = (SegmentMetadataQuery) inQ; | ||
| private final SegmentMetadataQuery query = updatedQuery; |
| @@ -121,7 +124,10 @@ public Sequence<SegmentAnalysis> doRun( | |||
| @Override | |||
| protected Ordering<SegmentAnalysis> makeOrdering(Query<SegmentAnalysis> query) | |||
There was a problem hiding this comment.
There seem to be no point for this anonymous QueryRunner to extend ResultMergeQueryRunner rather than BySegmentSkippingQueryRunner directly, because it overrides doRun(). Then this QueryRunner doesn't need to follow API of ResultMergeQueryRunner, it may accept SegmentMetadataQuery instead of Query<SegmentAnalysis> in makeOrdering() and createMergeFn(), to avoid casting.
| { | ||
| if (((SegmentMetadataQuery) query).isMerge()) { | ||
| SegmentMetadataQuery castedQuery = (SegmentMetadataQuery) query; | ||
| SegmentMetadataQuery updatedQuery =(SegmentMetadataQuery) (castedQuery.withAnalysisTypes(getFinalAnalysisTypes(castedQuery))); |
There was a problem hiding this comment.
This is already done in upstream doRun() call.
| return this; | ||
| } | ||
|
|
||
|
|
| @Override | ||
| public CacheStrategy<SegmentAnalysis, SegmentAnalysis, SegmentMetadataQuery> getCacheStrategy(final SegmentMetadataQuery query) | ||
| { | ||
|
|
| { | ||
| @Override | ||
| public boolean isCacheable(SegmentMetadataQuery query, boolean willMergeRunners) | ||
| public boolean isCacheable(SegmentMetadataQuery updatedQuery, boolean willMergeRunners) |
| public <T extends LogicalSegment> List<T> filterSegments(SegmentMetadataQuery query, List<T> segments) | ||
| { | ||
| if (!query.isUsingDefaultInterval()) { | ||
| SegmentMetadataQuery updatedQuery = (SegmentMetadataQuery) query.withAnalysisTypes(getFinalAnalysisTypes(query)); |
There was a problem hiding this comment.
This is pointless, because doesn't affect isUsingDefaultInterval
| private Boolean merge; | ||
| private Boolean lenientAggregatorMerge; | ||
| private Map<String, Object> context; | ||
| private Boolean usingDefaultInterval; |
There was a problem hiding this comment.
useDefaultInterval seems to be an unnecessary configuration, that allows to create inconsistency, if you pass useDefaultInterval=false, and querySegmentSpec which actually represents the default interval to SegmentMetadataQuery constructor.
I suggest the following plan:
- Don't add usingDefaultInterval to this builder
- Leave usingDefaultInterval parameter of SegmentMetadataQuery for compatibility, but ignore it, and document the fact that it is going to be removed.
- In the constructor, set usingDefaultInterval=true if querySegmentSpec == null or querySegmentSpec is not null, and it has just one interval which is equal to the default interval.
| return new BinaryFn<SegmentAnalysis, SegmentAnalysis, SegmentAnalysis>() | ||
| { | ||
| private final SegmentMetadataQuery query = (SegmentMetadataQuery) inQ; | ||
| private final SegmentMetadataQuery query = inQ; |
|
|
||
| @Override | ||
| protected Ordering<SegmentAnalysis> makeOrdering(Query<SegmentAnalysis> query) | ||
| protected Ordering<SegmentAnalysis> makeOrdering(SegmentMetadataQuery query) |
|
|
||
| @Override | ||
| protected BinaryFn<SegmentAnalysis, SegmentAnalysis, SegmentAnalysis> createMergeFn(final Query<SegmentAnalysis> inQ) | ||
| protected BinaryFn<SegmentAnalysis, SegmentAnalysis, SegmentAnalysis> createMergeFn(final SegmentMetadataQuery inQ) |
| import io.druid.query.metadata.SegmentMetadataQueryConfig; | ||
| import io.druid.query.spec.MultipleIntervalSegmentSpec; | ||
| import io.druid.query.spec.QuerySegmentSpec; | ||
| import jdk.nashorn.internal.ir.annotations.Ignore; |
There was a problem hiding this comment.
Don't use unrelated annotation
| @JsonProperty("context") Map<String, Object> context, | ||
| @JsonProperty("analysisTypes") EnumSet<AnalysisType> analysisTypes, | ||
| @JsonProperty("usingDefaultInterval") Boolean useDefaultInterval, | ||
| @Ignore @JsonProperty("usingDefaultInterval") Boolean useDefaultInterval, |
There was a problem hiding this comment.
Please document that this parameter is ignored in a simple comment, also add note that it is going to be removed, and left now only for the sake of compatibility.
| this.usingDefaultInterval = true; | ||
| } else { | ||
| this.usingDefaultInterval = useDefaultInterval == null ? false : useDefaultInterval; | ||
| if (querySegmentSpec.getIntervals().size() == 1 && querySegmentSpec.getIntervals() |
There was a problem hiding this comment.
Also please add a test for this
There was a problem hiding this comment.
Please address this. Also test that if the default interval is explicitly specified via query builder (rather than leaving it as null, that will be replaced with the default interval in the SegmentMetadataQuery's constructor), than isDefaultInterval is correctly set to true.
There was a problem hiding this comment.
Ok, I somehow missed .get(0) on the next line. The formatting is strange: suggested
if (querySegmentSpec.getIntervals().size() == 1 &&
querySegmentSpec.getIntervals().get(0).equals(DEFAULT_INTERVAL)) {|
|
||
| public SegmentMetadataQuery withFinalizedAnalysisTypes(SegmentMetadataQueryConfig config) | ||
| { | ||
| return Druids.SegmentMetadataQueryBuilder |
There was a problem hiding this comment.
Don't create a new query object if analysisTypes are already non-null, just return this
|
@gkc2104 BTW is this valid that |
|
@leventov Agreed, it is building a new interval using Default History and the max segment time. |
| for (int i = 0; i < filteredSegments2.size(); i++) { | ||
| Assert.assertEquals(expectedSegments2.get(i).getInterval(), filteredSegments2.get(i).getInterval()); | ||
| } | ||
|
|
There was a problem hiding this comment.
@leventov added the isUsingDefaultInterval in here along with a failing test (because of filterSegments implementation). What should I do about filterSegments though ?
|
Updated |
| final Interval targetInterval = new Interval(config.getDefaultHistory(), targetEnd); | ||
| List<Interval> intervals = query.getIntervals(); | ||
|
|
||
| DateTime queryStartTime = JodaUtils.ETERNITY.getEnd(); |
There was a problem hiding this comment.
In other parts of the Druid codebase new DateTime(JodaUtils.MAX_INSTANT) is used
| List<Interval> intervals = query.getIntervals(); | ||
|
|
||
| DateTime queryStartTime = JodaUtils.ETERNITY.getEnd(); | ||
| DateTime queryEndTIme = JodaUtils.ETERNITY.getStart(); |
| DateTime queryStartTime = JodaUtils.ETERNITY.getEnd(); | ||
| DateTime queryEndTIme = JodaUtils.ETERNITY.getStart(); | ||
|
|
||
| for (Interval interval : intervals) { |
There was a problem hiding this comment.
Consider
queryStartTime = intervals.stream().map(Interval::getEnd)
.max(Ordering.natural()).orElseThrow(IllegalStateException::new);|
@fjy restoring 0.10.1 milestone because it's an important change that restores compatibility of segment metadata queries. @gianm @drcrallen could you please review the design again? |
gianm
left a comment
There was a problem hiding this comment.
Looks good except for the defaultInterval stuff. I'm not sure how that got roped into this patch, but I think it's fine the way it is and doesn't need to be changed.
@leventov wrote,
useDefaultInterval seems to be an unnecessary configuration, that allows to create inconsistency, if you pass useDefaultInterval=false, and querySegmentSpec which actually represents the default interval to SegmentMetadataQuery constructor.
I suggest the following plan:
- Don't add usingDefaultInterval to this builder
- Leave usingDefaultInterval parameter of SegmentMetadataQuery for compatibility, but ignore it, and document the fact that it is going to be removed.
- In the constructor, set usingDefaultInterval=true if querySegmentSpec == null or querySegmentSpec is not null, and it has just one interval which is equal to the default interval.
See #1732 (comment) for the original rationale about why this exists. I guess it was confusing, so we should add a comment, but it does serve a purpose.
It definitely shouldn't be something a user should be able to provide in the builder. Its only purpose is to help "remember" whether or not a user provided intervals. The idea is we wanted providing null and providing the default interval explicitly to have different behavior. null does druid.query.segmentMetadata.defaultHistory prior to maxTime; explicitly specifying the interval from DEFAULT_INTERVAL will do exactly what the user asked for, and use that specific interval, not the one based on druid.query.segmentMetadata.defaultHistory.
| @@ -235,18 +236,27 @@ public SegmentAnalysis apply(@Nullable SegmentAnalysis input) | |||
| @Override | |||
| public <T extends LogicalSegment> List<T> filterSegments(SegmentMetadataQuery query, List<T> segments) | |||
There was a problem hiding this comment.
What's the purpose of the changes in this method? They don't seem related to the rest of the patch.
There was a problem hiding this comment.
Yes, these are unrelated to the changes proposed for this patch, I'll open up another PR for this.
The current implementation does not take into consideration the time range while filtering, I added an improvement to only include segments that are within the intervals mentioned by the query.
There was a problem hiding this comment.
The current implementation does not take into consideration the time range while filtering, I added an improvement to only include segments that are within the intervals mentioned by the query.
filterSegments doesn't need to worry about that, since the list of segments it gets has already been pruned down to whatever matches the intervals filter. In fact many implementations just return segments; with no changes -- check out timeseries for an example. filterSegments only has to potentially filter it down even further, if it wants (see timeBoundary for another example of that). I suppose the javadocs for filterSegments could be more clear on this.
Please also see #4259 (review) for rationale about the current behavior of defaultInterval.
There was a problem hiding this comment.
I was under the impression that this method was called to prune the segments to include only the required ones from a list of all segments, didn't know it was already pruned. Now that I look at the examples you suggested I understand why it is required even if we already prune it.
Reg: defaultInterval - I had some failing tests that were because of defaultInterval, I don't remember the exact test case, but reverting to the old behavior did not break any tests (probably the change of not having null configs helped I guess), so we can keep the implementation on master
I excluded these changes from this PR.
| { | ||
| SegmentMetadataQuery query = (SegmentMetadataQuery) inQ.getQuery(); | ||
| final SegmentAnalyzer analyzer = new SegmentAnalyzer(query.getAnalysisTypes()); | ||
| SegmentMetadataQuery updatedQuery = query.withFinalizedAnalysisTypes(toolChest.getConfig()); |
There was a problem hiding this comment.
Might as well do this and the cast in one line to avoid accidentally referencing query instead of updatedQuery.
| @JsonProperty("merge") Boolean merge, | ||
| @JsonProperty("context") Map<String, Object> context, | ||
| @JsonProperty("analysisTypes") EnumSet<AnalysisType> analysisTypes, | ||
| // useDefaultInterval will be removed, but is left for now for compatibility |
There was a problem hiding this comment.
What's wrong with useDefaultInterval?
|
Needs one more design review. |
#3773
Now that
defaultAnalysisTypesis being updated, it would make it easier for users to update to0.10.0if they could easily revert to old behavior without having to update all SegmentMetadataQueries they use.Suggest not only providing an option to revert to old behavior but also be able to specify their own defaultAnalysisTypes using
druid.query.segmentMetadata.defaultAnalysisType.