-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Towards consistent null handling #995
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,2 @@ | ||
| Druid - a distributed column store. | ||
| Copyright 2012-2015 Metamarkets Group Inc. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -81,7 +81,6 @@ | |
| <artifactId>mapdb</artifactId> | ||
| </dependency> | ||
|
|
||
|
|
||
| <!-- Tests --> | ||
| <dependency> | ||
| <groupId>junit</groupId> | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,71 @@ | ||
| /* | ||
| * Licensed to Metamarkets Group Inc. (Metamarkets) under one | ||
| * or more contributor license agreements. See the NOTICE file | ||
| * distributed with this work for additional information | ||
| * regarding copyright ownership. Metamarkets licenses this file | ||
| * to you under the Apache License, Version 2.0 (the | ||
| * "License"); you may not use this file except in compliance | ||
| * with the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, | ||
| * software distributed under the License is distributed on an | ||
| * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| * KIND, either express or implied. See the License for the | ||
| * specific language governing permissions and limitations | ||
| * under the License. | ||
| */ | ||
|
|
||
| package io.druid.segment; | ||
|
|
||
| import com.google.common.base.Strings; | ||
| import com.google.common.collect.Iterators; | ||
| import io.druid.segment.data.IndexedInts; | ||
|
|
||
| import java.util.Iterator; | ||
|
|
||
| public class NullDimensionSelector implements DimensionSelector | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Class is lacking tests.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Wait, I think I found them as part of Timeseries query runner test.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There are no direct tests, that's true. It is tested through the various queries, but some direct tests would probably also be meaningful. |
||
| { | ||
|
|
||
| private static final IndexedInts SINGLETON = new IndexedInts() { | ||
| @Override | ||
| public int size() { | ||
| return 1; | ||
| } | ||
|
|
||
| @Override | ||
| public int get(int index) { | ||
| return 0; | ||
| } | ||
|
|
||
| @Override | ||
| public Iterator<Integer> iterator() { | ||
| return Iterators.singletonIterator(0); | ||
| } | ||
| }; | ||
|
|
||
| @Override | ||
| public IndexedInts getRow() | ||
| { | ||
| return SINGLETON; | ||
| } | ||
|
|
||
| @Override | ||
| public int getValueCardinality() | ||
| { | ||
| return 1; | ||
| } | ||
|
|
||
| @Override | ||
| public String lookupName(int id) | ||
| { | ||
| return null; | ||
| } | ||
|
|
||
| @Override | ||
| public int lookupId(String name) | ||
| { | ||
| return Strings.isNullOrEmpty(name) ? 0 : -1; | ||
| } | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this only on timeseries query and not other query types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe the other queries generate empty data entries for time buckets that don't have data, has that changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the behavior of groupBy with no dimensions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only creates an entry for each time bucket that actually exists, iirc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gianm and I fixed some inconsistencies related to empty time buckets when buckets at query interval boundaries and maxtime don't line up, see #705.
It is still not consistent with groupBy, which confuses users (#701).
Do we need this flag for topN as well?
I believe things would be more consistent overall if we skipped empty buckets by default, since if data is missing for an entire segment granularity, those buckets will be missing anyway, and I don't believe results should depend on segment granularity.
I would be in favor to skip empty buckets by default in 0.7, but we may want to make that change as part of a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xvrl re: topN
And yes, skipping empty buckets by default is what timeseries initially did, then I had it auto-generate empty values as an indirect mechanism of figuring out if a segment exists or not (i.e. it will generate 0's if the segment exists and there just is nothing there, where if the segment doesn't exist then it won't generate anything). This proved to be not enough to determine that a segment isn't actually there, though. So, while I agree with you in principle that switching back to the original "never generate empty values" behavior is more correct, the fact is that there might be people who are expecting those values to be generated for them and making this change in a backwards-incompatible manner could make it very difficult for them to actually move forward.
If we want to make the change to timeseries defaulting to not generating anything, that should be done in a subsequent version. That allows people using the system some time to set this parameter first and rework their systems before changing the default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, we don't have to make those changes in 0.7, only if we felt strongly about making things more consistent and did not want to wait for 0.8 to make that change.