Skip to content

Towards consistent null handling#995

Merged
xvrl merged 1 commit intomasterfrom
null-fixes
Feb 2, 2015
Merged

Towards consistent null handling#995
xvrl merged 1 commit intomasterfrom
null-fixes

Conversation

@cheddar
Copy link
Copy Markdown
Contributor

@cheddar cheddar commented Dec 24, 2014

This PR includes a number of fixes to handle nulls more consistently on the query side. These fixes were done to support a user who is leveraging the schema-less capabilities of Druid. It's sufficient for all needs we've currently found, but I am not certain that it is comprehensive just yet.

Fixes #665 in the following manner

  • return null values for all matching rows when a columns is missing (e.g. for groupBy, topN)
  • treat empty string and null the same, but always return null
  • allow specifying null in filters
  • for multi-value dimensions, if the dimension has multiple values:
    1. dim extraction returns null -> null
    2. missing column -> null
    3. [] -> null
    4. [""] -> [null]
    5. ["", "a"] -> [null, "a"]
  • for single value dimensions
    1. missing column -> null
    2. dim extraction returns null -> null
    3. [""] -> null
    4. [] -> null

Also, on the PR is the addition of a context parameter on timeseries queries that allows it to ignore empty buckets. I realize that this should've been separated into two PRs, but there is a bit of context behind these commits that makes that difficult. It's just a few changes in TimeseriesQuery, TimeseriesQueryEngine and TimeseriesQueryRunnerTest

@fjy
Copy link
Copy Markdown
Contributor

fjy commented Dec 24, 2014

@cheddar Welcome back :P

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this only on timeseries query and not other query types?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe the other queries generate empty data entries for time buckets that don't have data, has that changed?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the behavior of groupBy with no dimensions?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only creates an entry for each time bucket that actually exists, iirc

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gianm and I fixed some inconsistencies related to empty time buckets when buckets at query interval boundaries and maxtime don't line up, see #705.
It is still not consistent with groupBy, which confuses users (#701).

Do we need this flag for topN as well?

I believe things would be more consistent overall if we skipped empty buckets by default, since if data is missing for an entire segment granularity, those buckets will be missing anyway, and I don't believe results should depend on segment granularity.

I would be in favor to skip empty buckets by default in 0.7, but we may want to make that change as part of a separate PR.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xvrl re: topN

I don't believe the other queries generate empty data entries for time buckets that don't have data, has that changed?

And yes, skipping empty buckets by default is what timeseries initially did, then I had it auto-generate empty values as an indirect mechanism of figuring out if a segment exists or not (i.e. it will generate 0's if the segment exists and there just is nothing there, where if the segment doesn't exist then it won't generate anything). This proved to be not enough to determine that a segment isn't actually there, though. So, while I agree with you in principle that switching back to the original "never generate empty values" behavior is more correct, the fact is that there might be people who are expecting those values to be generated for them and making this change in a backwards-incompatible manner could make it very difficult for them to actually move forward.

If we want to make the change to timeseries defaulting to not generating anything, that should be done in a subsequent version. That allows people using the system some time to set this parameter first and rework their systems before changing the default.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, we don't have to make those changes in 0.7, only if we felt strongly about making things more consistent and did not want to wait for 0.8 to make that change.

@fjy
Copy link
Copy Markdown
Contributor

fjy commented Dec 24, 2014

I have a general comment around v9 segments. If I remember, I think we have two entries in the dictionary, one for nulls and the other for empty strings. Even with these changes, I think filtering on empty strings, filtering on nulls, and filtering on empty strings & nulls will return different results.

@cheddar
Copy link
Copy Markdown
Contributor Author

cheddar commented Dec 24, 2014

Null and the empty string look the exact same in the segments (there is only one dictionary entry), that's why I keep saying that it's really difficult to handle them differently.

@cheddar
Copy link
Copy Markdown
Contributor Author

cheddar commented Dec 24, 2014

Also, I disagree with your statement that filtering on different things will produce different results. I'm assuming you have a specific case in your mind, so if you want to create a unit test for it and push it up, if it's failing I'll fix. Though, I think it'll pass ;).

@fjy
Copy link
Copy Markdown
Contributor

fjy commented Dec 24, 2014

Okay, let me try and reproduce when I'm back. I remember looking into this a little while ago and the v8 --> v9 conversion could sometimes produce 2 entries. Of course I could just be imaging things :)

@cheddar
Copy link
Copy Markdown
Contributor Author

cheddar commented Dec 24, 2014

If v8 -> v9 conversion is producing two entries, that's a bug and should be fixed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Class is lacking tests.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, I think I found them as part of Timeseries query runner test.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no direct tests, that's true. It is tested through the various queries, but some direct tests would probably also be meaningful.

@drcrallen
Copy link
Copy Markdown
Contributor

Scala tends to treat things as empty rather than null, @cheddar : can you please comment on why you would like to use null to mean empty? (as opposed to having empty meaning empty, and storing empty variables instead of null variables)

@cheddar
Copy link
Copy Markdown
Contributor Author

cheddar commented Dec 24, 2014

@drcrallen I'm not sure what you are asking, which part of the code are you talking about? Or are you just asking a philosphical question about the use of null instead of all the Optional<> stuff that people are using these days?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure there is a test case for the following scenario:

Dim contains only null values, filter is a not filter of a selector of any value for that dimension.

Expected results should be all values.

Note this varies from traditional SQL, where the expected results would be no values.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@drcrallen

I don't quite understand this. I believe you are saying that if I do a timeseries query that filters on value "xyz" on a dimension that doesn't exist (i.e. only null), the filter should match everything?

That seems to be rather unexpected and against the general semantics that this patch is providing.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cheddar : not that filters to include xyz, but rather if you have a dimension, let's say "cake", which never actually have a value for (all null) and you say "give me all events where cake is not cheesecake", then I would expect it to return all events.

This contradicts traditional sql, where the not of a null is a null.

@drcrallen
Copy link
Copy Markdown
Contributor

@cheddar : Higher question overall.

But from a conceptual standpoint null and empty are distinct ideas. I get a bit nervous throwing null around as a valid value because it's confusing if it means "error" or "empty", and throwing nulls around on purpose is an easy way to accidentally NPE somewhere. Lots of other druid code uses empty sets instead of null, and I was wondering if you have a compelling through on why things should map to null instead of to an empty object.

Additionally, as per the SQL example mentioned previously, it helps if we have just one representation that is not confusing to folks coming from other data stores.

@cheddar
Copy link
Copy Markdown
Contributor Author

cheddar commented Dec 28, 2014

Ok, I've updated everything except for adding tests specific to NullDimensionSelector

Fwiw, I went ahead and tried creating a Column implementation that acts just like a single valued null column. It was pretty hard to integrate and ended up breaking some unit tests because the "null handling" logic can take a lot more into account when it is defining "correct" behavior while creating a blanket "null column" doesn't.

One of the big places that this showed up was in the differences between how metric and dimension columns are treated. With just a null they can be treated "correctly" separately from each other, which is not as simple with a column that returns nulls.

@drcrallen hopefully that resolves the null versus "empty object" thing. Fwiw, you define the semantics of null as "error" or "empty", neither of which are the semantics I assign to it. null means "does not exist" as in, the thing you asked for does not exist. Given the knowledge that the thing you asked for doesn't exist, you can do something meaningful.

On the question of "if you pass nulls, now I have to check for them" if you are working in Java and you are operating with Objects you should _always_ be doing null checks. I don't care if it's an Optional<> that you got returned, that can still be null and generate an NPE. In fact, null checks are the cheapest conditional you can do in Java because they expected them to be done so much. The vast majority of time the check is done when loading the actual reference.

Lastly, I'm do not understand what you are trying to say with

Additionally, as per the SQL example mentioned previously, it helps if we have just one representation that is not confusing to folks coming from other data stores.

@sharrissf
Copy link
Copy Markdown

@cheddar Hi, I haven't looked at the code but from your description it sounded like it might be a case where double dispatching might work well. The idea is that object Foo askes the null object a question with itself as a parameter and then then Null object calls back on foo for portions of the info it needs. Trying to picture if that's a good solution here...

@cheddar
Copy link
Copy Markdown
Contributor Author

cheddar commented Dec 29, 2014

@sharrissf I think that at this point, it's more trouble than it is worth to "fix" the issue. From the initial attempt at changing things a bit, I had to touch a lot more than the 3 places that this is touching to handling nulls better. It's definitely a task that we can take on in a long-term fashion, but I don't think the code base is quite ready for it yet.

@drcrallen Thanks for the clarification on the unit test. I added the test that you requested.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency, let's try to move everything to Strings.isNullOrEmpty, which may also be faster, since it does not use .equals.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed this one.

@drcrallen
Copy link
Copy Markdown
Contributor

👍 once Travis CI verifies build.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing newline at end of file (Preferences -> Editor -> General -> Ensure line feed at file end in IntelliJ)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that part of your style? I believe I've always had it set to not have lines at the end of file, 'cause they annoy me ;).

Btw, if we want to enforce some sort of style on the code, let's not do it with nit-picky comments like this, let's do it by publishing style configs.

@xvrl
Copy link
Copy Markdown
Member

xvrl commented Dec 29, 2014

Can we add a groupBy and topN test with columns containing null / empty string values along with regular values, to make sure we define the expected behavior?

@cheddar
Copy link
Copy Markdown
Contributor Author

cheddar commented Dec 29, 2014

@xvrl Tests do exist for null values, can you look at those and ask more specifically for what you think is not covered?

@cheddar
Copy link
Copy Markdown
Contributor Author

cheddar commented Dec 29, 2014

@xvrl I just pushed something that adds the various "singleton" objects you asked for.

@xvrl
Copy link
Copy Markdown
Member

xvrl commented Dec 29, 2014

Re tests with columns containing null values, I meant something like this https://github.com/druid-io/druid/pull/495/files#diff-4fc7397755af8d2c280382f211527e23R1186 where the topN is done over an existing column where some values are null or transformed into nulls.

Same thing for groupBy, have somethings like this https://github.com/druid-io/druid/pull/495/files#diff-fdc478cff93cb2313327587da0c37157R255

If you want I can add them to your branch directly.

@cheddar
Copy link
Copy Markdown
Contributor Author

cheddar commented Dec 29, 2014

Ah, ok. Yeah, I don't have tests on a column that mixes nulls with non-nulls. IIRC, I didn't do that because I don't believe we have a test data set that provides that. If you would be willing to provide some tests on the branch (passing or no), that'd be awesome. If they aren't passing, I'll make them pass.

Also, in the diffs you linked, I noticed you were doing a little bit of jiggering to get around ImmutableMap not allowing a "null" key. You can also just switch to using a normal Map. I.e. instead of

ImmutableMap.of(null, 'billy');

You can do

new LinkedList(){{ put(null, "billy"); }};

And it will work. This incantation is fine for tests, but it's not great for actual production code (it's sub-classing LinkedHashMap and adding a new constructor after the super constructor).

@sharrissf
Copy link
Copy Markdown

@cheddar cool, yeah, abstraction and factoring are things best timeboxed and improved over time

@cheddar
Copy link
Copy Markdown
Contributor Author

cheddar commented Dec 30, 2014

@xvrl unit tests fixed

@xvrl
Copy link
Copy Markdown
Member

xvrl commented Jan 5, 2015

I did a quick test replacing news with empty string in druid.sample.tsv.* and it doesn't seem to return the expected result in quite a few test cases. Sometimes we get back null, sometimes "". This obviously requires changing most test, so maybe we'll want to add new ones to cover those case, but that also requires changing quite a bit of code.

@fjy
Copy link
Copy Markdown
Contributor

fjy commented Jan 27, 2015

@cheddar Apparently #1014 is not fixed by this pull request. I am +1 on merging this pull request as it makes progress on how Druid handles nulls, but there are still numerous problems with handling nulls and empty strings in Druid and I think we should consider fixing them sooner rather than later.

@drcrallen
Copy link
Copy Markdown
Contributor

Can this get squashed/rebased before merging?

@nebrera
Copy link
Copy Markdown
Contributor

nebrera commented Jan 28, 2015

Actually we are having this problems with null handlers answering wrong
data when I query or filter this dimensions with null values.

whenever you merge it let me know and I will test it on our labs

Thanks for all

Pablo Nebrera


Pablo Nebrera Herrera
Chief Architect
redborder.net / pablonebrera@redborder.net
Mobile: +34 685 48 39 44
Phone: +34 955 60 11 60

https://twitter.com/redBorder_net
https://plus.google.com/u/0/b/115823750653188478256/115823750653188478256/posts
https://www.linkedin.com/company/redborder

Piénsalo antes de imprimir este mensaje

Este correo electrónico, incluidos sus anexos, se dirige exclusivamente a
su destinatario. Contiene información CONFIDENCIAL cuya divulgación está
prohibida por la ley o puede estar sometida a secreto profesional. Si ha
recibido este mensaje por error, le rogamos nos lo comunique inmediatamente
y proceda a su destrucción.

This email, including attachments, is intended exclusively for its
addressee. It contains information that is CONFIDENTIAL whose disclosure is
prohibited by law and may be covered by legal privilege. If you have
received this email in error, please notify the sender and delete it from
your system.

2015-01-27 23:56 GMT+01:00 Fangjin Yang notifications@github.com:

@cheddar https://github.com/cheddar Apparently #1014
#1014 is not fixed by this pull
request. I am +1 on merging this pull request as it makes progress on how
Druid handles nulls, but there are still numerous problems with handling
nulls and empty strings in Druid and I think we should consider fixing them
sooner rather than later.


Reply to this email directly or view it on GitHub
#995 (comment).

@cheddar
Copy link
Copy Markdown
Contributor Author

cheddar commented Jan 28, 2015

@drcrallen yes, it can, but only when there's actually some indication that the code could get merged. I'm not interested in wasting time making it mergeable again only to have more merge conflicts introduced later down the line because the PR cannot be merged for political reasons.

@xvrl xvrl modified the milestones: 0.7.1, 0.7.0 Jan 29, 2015
@fjy fjy force-pushed the master branch 2 times, most recently from 8b0ec82 to d05032b Compare February 1, 2015 04:57
This commit also includes
1) the addition of a context parameter on timeseries queries that allows it to ignore empty buckets instead of generating results for them
2) A cleanup of an unused method on an interface
xvrl added a commit that referenced this pull request Feb 2, 2015
Towards consistent null handling
@xvrl xvrl merged commit ccebf28 into master Feb 2, 2015
@xvrl xvrl deleted the null-fixes branch February 2, 2015 22:39
@fjy fjy mentioned this pull request Dec 29, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

More consistent null handing

7 participants