Towards consistent null handling by cheddar · Pull Request #995 · apache/druid

cheddar · 2014-12-24T01:40:44Z

This PR includes a number of fixes to handle nulls more consistently on the query side. These fixes were done to support a user who is leveraging the schema-less capabilities of Druid. It's sufficient for all needs we've currently found, but I am not certain that it is comprehensive just yet.

Fixes #665 in the following manner

return null values for all matching rows when a columns is missing (e.g. for groupBy, topN)
treat empty string and null the same, but always return null
allow specifying null in filters
for multi-value dimensions, if the dimension has multiple values:
1. dim extraction returns null -> null
2. missing column -> null
3. [] -> null
4. [""] -> [null]
5. ["", "a"] -> [null, "a"]
for single value dimensions
1. missing column -> null
2. dim extraction returns null -> null
3. [""] -> null
4. [] -> null

Also, on the PR is the addition of a context parameter on timeseries queries that allows it to ignore empty buckets. I realize that this should've been separated into two PRs, but there is a bit of context behind these commits that makes that difficult. It's just a few changes in TimeseriesQuery, TimeseriesQueryEngine and TimeseriesQueryRunnerTest

fjy · 2014-12-24T01:54:57Z

@cheddar Welcome back :P

fjy · 2014-12-24T02:32:19Z

why is this only on timeseries query and not other query types?

I don't believe the other queries generate empty data entries for time buckets that don't have data, has that changed?

What is the behavior of groupBy with no dimensions?

Only creates an entry for each time bucket that actually exists, iirc

@gianm and I fixed some inconsistencies related to empty time buckets when buckets at query interval boundaries and maxtime don't line up, see #705.
It is still not consistent with groupBy, which confuses users (#701).

Do we need this flag for topN as well?

I believe things would be more consistent overall if we skipped empty buckets by default, since if data is missing for an entire segment granularity, those buckets will be missing anyway, and I don't believe results should depend on segment granularity.

I would be in favor to skip empty buckets by default in 0.7, but we may want to make that change as part of a separate PR.

@xvrl re: topN

I don't believe the other queries generate empty data entries for time buckets that don't have data, has that changed?

And yes, skipping empty buckets by default is what timeseries initially did, then I had it auto-generate empty values as an indirect mechanism of figuring out if a segment exists or not (i.e. it will generate 0's if the segment exists and there just is nothing there, where if the segment doesn't exist then it won't generate anything). This proved to be not enough to determine that a segment isn't actually there, though. So, while I agree with you in principle that switching back to the original "never generate empty values" behavior is more correct, the fact is that there might be people who are expecting those values to be generated for them and making this change in a backwards-incompatible manner could make it very difficult for them to actually move forward.

If we want to make the change to timeseries defaulting to not generating anything, that should be done in a subsequent version. That allows people using the system some time to set this parameter first and rework their systems before changing the default.

Agree, we don't have to make those changes in 0.7, only if we felt strongly about making things more consistent and did not want to wait for 0.8 to make that change.

fjy · 2014-12-24T02:44:04Z

I have a general comment around v9 segments. If I remember, I think we have two entries in the dictionary, one for nulls and the other for empty strings. Even with these changes, I think filtering on empty strings, filtering on nulls, and filtering on empty strings & nulls will return different results.

cheddar · 2014-12-24T02:46:11Z

Null and the empty string look the exact same in the segments (there is only one dictionary entry), that's why I keep saying that it's really difficult to handle them differently.

cheddar · 2014-12-24T02:47:09Z

Also, I disagree with your statement that filtering on different things will produce different results. I'm assuming you have a specific case in your mind, so if you want to create a unit test for it and push it up, if it's failing I'll fix. Though, I think it'll pass ;).

fjy · 2014-12-24T02:48:25Z

Okay, let me try and reproduce when I'm back. I remember looking into this a little while ago and the v8 --> v9 conversion could sometimes produce 2 entries. Of course I could just be imaging things :)

cheddar · 2014-12-24T02:50:22Z

If v8 -> v9 conversion is producing two entries, that's a bug and should be fixed.

drcrallen · 2014-12-24T03:36:30Z

Class is lacking tests.

Wait, I think I found them as part of Timeseries query runner test.

There are no direct tests, that's true. It is tested through the various queries, but some direct tests would probably also be meaningful.

drcrallen · 2014-12-24T03:38:31Z

Scala tends to treat things as empty rather than null, @cheddar : can you please comment on why you would like to use null to mean empty? (as opposed to having empty meaning empty, and storing empty variables instead of null variables)

cheddar · 2014-12-24T03:44:36Z

@drcrallen I'm not sure what you are asking, which part of the code are you talking about? Or are you just asking a philosphical question about the use of null instead of all the Optional<> stuff that people are using these days?

drcrallen · 2014-12-24T03:51:36Z

Please make sure there is a test case for the following scenario:

Dim contains only null values, filter is a not filter of a selector of any value for that dimension.

Expected results should be all values.

Note this varies from traditional SQL, where the expected results would be no values.

@drcrallen

I don't quite understand this. I believe you are saying that if I do a timeseries query that filters on value "xyz" on a dimension that doesn't exist (i.e. only null), the filter should match everything?

That seems to be rather unexpected and against the general semantics that this patch is providing.

@cheddar : not that filters to include xyz, but rather if you have a dimension, let's say "cake", which never actually have a value for (all null) and you say "give me all events where cake is not cheesecake", then I would expect it to return all events.

This contradicts traditional sql, where the not of a null is a null.

drcrallen · 2014-12-24T04:09:00Z

@cheddar : Higher question overall.

But from a conceptual standpoint null and empty are distinct ideas. I get a bit nervous throwing null around as a valid value because it's confusing if it means "error" or "empty", and throwing nulls around on purpose is an easy way to accidentally NPE somewhere. Lots of other druid code uses empty sets instead of null, and I was wondering if you have a compelling through on why things should map to null instead of to an empty object.

Additionally, as per the SQL example mentioned previously, it helps if we have just one representation that is not confusing to folks coming from other data stores.

cheddar · 2014-12-28T01:52:36Z

Ok, I've updated everything except for adding tests specific to NullDimensionSelector

Fwiw, I went ahead and tried creating a Column implementation that acts just like a single valued null column. It was pretty hard to integrate and ended up breaking some unit tests because the "null handling" logic can take a lot more into account when it is defining "correct" behavior while creating a blanket "null column" doesn't.

One of the big places that this showed up was in the differences between how metric and dimension columns are treated. With just a null they can be treated "correctly" separately from each other, which is not as simple with a column that returns nulls.

@drcrallen hopefully that resolves the null versus "empty object" thing. Fwiw, you define the semantics of null as "error" or "empty", neither of which are the semantics I assign to it. null means "does not exist" as in, the thing you asked for does not exist. Given the knowledge that the thing you asked for doesn't exist, you can do something meaningful.

On the question of "if you pass nulls, now I have to check for them" if you are working in Java and you are operating with Objects you should _always_ be doing null checks. I don't care if it's an Optional<> that you got returned, that can still be null and generate an NPE. In fact, null checks are the cheapest conditional you can do in Java because they expected them to be done so much. The vast majority of time the check is done when loading the actual reference.

Lastly, I'm do not understand what you are trying to say with

Additionally, as per the SQL example mentioned previously, it helps if we have just one representation that is not confusing to folks coming from other data stores.

sharrissf · 2014-12-28T19:49:45Z

@cheddar Hi, I haven't looked at the code but from your description it sounded like it might be a case where double dispatching might work well. The idea is that object Foo askes the null object a question with itself as a parameter and then then Null object calls back on foo for portions of the info it needs. Trying to picture if that's a good solution here...

cheddar · 2014-12-29T18:32:09Z

@sharrissf I think that at this point, it's more trouble than it is worth to "fix" the issue. From the initial attempt at changing things a bit, I had to touch a lot more than the 3 places that this is touching to handling nulls better. It's definitely a task that we can take on in a long-term fashion, but I don't think the code base is quite ready for it yet.

@drcrallen Thanks for the clarification on the unit test. I added the test that you requested.

xvrl · 2014-12-29T19:21:31Z

For consistency, let's try to move everything to Strings.isNullOrEmpty, which may also be faster, since it does not use .equals.

Missed this one.

drcrallen · 2014-12-29T19:37:56Z

👍 once Travis CI verifies build.

xvrl · 2014-12-29T19:40:51Z

missing newline at end of file (Preferences -> Editor -> General -> Ensure line feed at file end in IntelliJ)

Is that part of your style? I believe I've always had it set to not have lines at the end of file, 'cause they annoy me ;).

Btw, if we want to enforce some sort of style on the code, let's not do it with nit-picky comments like this, let's do it by publishing style configs.

xvrl · 2014-12-29T20:07:59Z

Can we add a groupBy and topN test with columns containing null / empty string values along with regular values, to make sure we define the expected behavior?

cheddar · 2014-12-29T20:14:12Z

@xvrl Tests do exist for null values, can you look at those and ask more specifically for what you think is not covered?

cheddar · 2014-12-29T20:27:00Z

@xvrl I just pushed something that adds the various "singleton" objects you asked for.

xvrl · 2014-12-29T21:10:45Z

Re tests with columns containing null values, I meant something like this https://github.com/druid-io/druid/pull/495/files#diff-4fc7397755af8d2c280382f211527e23R1186 where the topN is done over an existing column where some values are null or transformed into nulls.

Same thing for groupBy, have somethings like this https://github.com/druid-io/druid/pull/495/files#diff-fdc478cff93cb2313327587da0c37157R255

If you want I can add them to your branch directly.

cheddar · 2014-12-29T21:51:21Z

Ah, ok. Yeah, I don't have tests on a column that mixes nulls with non-nulls. IIRC, I didn't do that because I don't believe we have a test data set that provides that. If you would be willing to provide some tests on the branch (passing or no), that'd be awesome. If they aren't passing, I'll make them pass.

Also, in the diffs you linked, I noticed you were doing a little bit of jiggering to get around ImmutableMap not allowing a "null" key. You can also just switch to using a normal Map. I.e. instead of

ImmutableMap.of(null, 'billy');

You can do

new LinkedList(){{ put(null, "billy"); }};

And it will work. This incantation is fine for tests, but it's not great for actual production code (it's sub-classing LinkedHashMap and adding a new constructor after the super constructor).

sharrissf · 2014-12-29T21:57:00Z

@cheddar cool, yeah, abstraction and factoring are things best timeboxed and improved over time

cheddar · 2014-12-30T00:11:06Z

@xvrl unit tests fixed

xvrl · 2015-01-05T17:56:11Z

I did a quick test replacing news with empty string in druid.sample.tsv.* and it doesn't seem to return the expected result in quite a few test cases. Sometimes we get back null, sometimes "". This obviously requires changing most test, so maybe we'll want to add new ones to cover those case, but that also requires changing quite a bit of code.

fjy · 2015-01-27T22:56:17Z

@cheddar Apparently #1014 is not fixed by this pull request. I am +1 on merging this pull request as it makes progress on how Druid handles nulls, but there are still numerous problems with handling nulls and empty strings in Druid and I think we should consider fixing them sooner rather than later.

drcrallen · 2015-01-27T23:06:41Z

Can this get squashed/rebased before merging?

nebrera · 2015-01-28T01:07:56Z

Actually we are having this problems with null handlers answering wrong
data when I query or filter this dimensions with null values.

whenever you merge it let me know and I will test it on our labs

Thanks for all

Pablo Nebrera

Pablo Nebrera Herrera
Chief Architect
redborder.net / pablonebrera@redborder.net
Mobile: +34 685 48 39 44
Phone: +34 955 60 11 60

https://twitter.com/redBorder_net
https://plus.google.com/u/0/b/115823750653188478256/115823750653188478256/posts
https://www.linkedin.com/company/redborder

Piénsalo antes de imprimir este mensaje

Este correo electrónico, incluidos sus anexos, se dirige exclusivamente a
su destinatario. Contiene información CONFIDENCIAL cuya divulgación está
prohibida por la ley o puede estar sometida a secreto profesional. Si ha
recibido este mensaje por error, le rogamos nos lo comunique inmediatamente
y proceda a su destrucción.

This email, including attachments, is intended exclusively for its
addressee. It contains information that is CONFIDENTIAL whose disclosure is
prohibited by law and may be covered by legal privilege. If you have
received this email in error, please notify the sender and delete it from
your system.

2015-01-27 23:56 GMT+01:00 Fangjin Yang notifications@github.com:

@cheddar https://github.com/cheddar Apparently #1014
#1014 is not fixed by this pull
request. I am +1 on merging this pull request as it makes progress on how
Druid handles nulls, but there are still numerous problems with handling
nulls and empty strings in Druid and I think we should consider fixing them
sooner rather than later.

—
Reply to this email directly or view it on GitHub
#995 (comment).

cheddar · 2015-01-28T05:05:56Z

@drcrallen yes, it can, but only when there's actually some indication that the code could get merged. I'm not interested in wasting time making it mergeable again only to have more merge conflicts introduced later down the line because the PR cannot be merged for political reasons.

This commit also includes 1) the addition of a context parameter on timeseries queries that allows it to ignore empty buckets instead of generating results for them 2) A cleanup of an unused method on an interface

Towards consistent null handling

fjy reviewed Dec 24, 2014
View reviewed changes

drcrallen reviewed Dec 24, 2014
View reviewed changes

cheddar force-pushed the null-fixes branch from 7e5d6ea to 9fbed7e Compare December 28, 2014 01:44

cheddar force-pushed the null-fixes branch from 9fbed7e to b5fc04b Compare December 29, 2014 18:29

xvrl reviewed Dec 29, 2014
View reviewed changes

cheddar force-pushed the null-fixes branch from b5fc04b to 39a1105 Compare December 29, 2014 19:32

xvrl reviewed Dec 29, 2014
View reviewed changes

cheddar force-pushed the null-fixes branch 2 times, most recently from daeb2e7 to 23bd58a Compare December 30, 2014 19:26

drcrallen mentioned this pull request Jan 2, 2015

Allow arbitrary dimensions in ServiceMetricEvents. metamx/emitter#7

Merged

fjy mentioned this pull request Jan 7, 2015

real time node lead to wrong result(empty result) for the request with regexp filter when real time node is ingesting kafka msg. #1014

Closed

xvrl modified the milestones: 0.7.1, 0.7.0 Jan 29, 2015

fjy force-pushed the master branch 2 times, most recently from 8b0ec82 to d05032b Compare February 1, 2015 04:57

cheddar force-pushed the null-fixes branch from 23bd58a to fdb0da0 Compare February 2, 2015 17:59

xvrl force-pushed the null-fixes branch from 2f400d2 to 04c64bb Compare February 2, 2015 20:51

Towards consistent null handling

42eba98

This commit also includes 1) the addition of a context parameter on timeseries queries that allows it to ignore empty buckets instead of generating results for them 2) A cleanup of an unused method on an interface

xvrl force-pushed the null-fixes branch from 04c64bb to 42eba98 Compare February 2, 2015 20:53

xvrl added a commit that referenced this pull request Feb 2, 2015

Merge pull request #995 from druid-io/null-fixes

ccebf28

Towards consistent null handling

xvrl merged commit ccebf28 into master Feb 2, 2015

xvrl deleted the null-fixes branch February 2, 2015 22:39

xvrl mentioned this pull request Jun 22, 2015

Basically, avoid creating new Lists, when there is no mapping for a key druid-io/druid-api#45

Closed

fjy mentioned this pull request Dec 29, 2015

build v9 directly #2138

Merged

xvrl mentioned this pull request Mar 28, 2016

Inconsistent empty-set filtering behavior on multi-value columns #2750

Closed

Conversation

cheddar commented Dec 24, 2014

Uh oh!

fjy commented Dec 24, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fjy commented Dec 24, 2014

Uh oh!

cheddar commented Dec 24, 2014

Uh oh!

cheddar commented Dec 24, 2014

Uh oh!

fjy commented Dec 24, 2014

Uh oh!

cheddar commented Dec 24, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

drcrallen commented Dec 24, 2014

Uh oh!

cheddar commented Dec 24, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

drcrallen commented Dec 24, 2014

Uh oh!

cheddar commented Dec 28, 2014

Uh oh!

sharrissf commented Dec 28, 2014

Uh oh!

cheddar commented Dec 29, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

drcrallen commented Dec 29, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xvrl commented Dec 29, 2014

Uh oh!

cheddar commented Dec 29, 2014

Uh oh!

cheddar commented Dec 29, 2014

Uh oh!

xvrl commented Dec 29, 2014

Uh oh!

cheddar commented Dec 29, 2014

Uh oh!

sharrissf commented Dec 29, 2014

Uh oh!

cheddar commented Dec 30, 2014

Uh oh!

xvrl commented Jan 5, 2015

Uh oh!

fjy commented Jan 27, 2015