Conversation
|
@cheddar Welcome back :P |
There was a problem hiding this comment.
why is this only on timeseries query and not other query types?
There was a problem hiding this comment.
I don't believe the other queries generate empty data entries for time buckets that don't have data, has that changed?
There was a problem hiding this comment.
What is the behavior of groupBy with no dimensions?
There was a problem hiding this comment.
Only creates an entry for each time bucket that actually exists, iirc
There was a problem hiding this comment.
@gianm and I fixed some inconsistencies related to empty time buckets when buckets at query interval boundaries and maxtime don't line up, see #705.
It is still not consistent with groupBy, which confuses users (#701).
Do we need this flag for topN as well?
I believe things would be more consistent overall if we skipped empty buckets by default, since if data is missing for an entire segment granularity, those buckets will be missing anyway, and I don't believe results should depend on segment granularity.
I would be in favor to skip empty buckets by default in 0.7, but we may want to make that change as part of a separate PR.
There was a problem hiding this comment.
@xvrl re: topN
I don't believe the other queries generate empty data entries for time buckets that don't have data, has that changed?
And yes, skipping empty buckets by default is what timeseries initially did, then I had it auto-generate empty values as an indirect mechanism of figuring out if a segment exists or not (i.e. it will generate 0's if the segment exists and there just is nothing there, where if the segment doesn't exist then it won't generate anything). This proved to be not enough to determine that a segment isn't actually there, though. So, while I agree with you in principle that switching back to the original "never generate empty values" behavior is more correct, the fact is that there might be people who are expecting those values to be generated for them and making this change in a backwards-incompatible manner could make it very difficult for them to actually move forward.
If we want to make the change to timeseries defaulting to not generating anything, that should be done in a subsequent version. That allows people using the system some time to set this parameter first and rework their systems before changing the default.
There was a problem hiding this comment.
Agree, we don't have to make those changes in 0.7, only if we felt strongly about making things more consistent and did not want to wait for 0.8 to make that change.
|
I have a general comment around v9 segments. If I remember, I think we have two entries in the dictionary, one for nulls and the other for empty strings. Even with these changes, I think filtering on empty strings, filtering on nulls, and filtering on empty strings & nulls will return different results. |
|
Null and the empty string look the exact same in the segments (there is only one dictionary entry), that's why I keep saying that it's really difficult to handle them differently. |
|
Also, I disagree with your statement that filtering on different things will produce different results. I'm assuming you have a specific case in your mind, so if you want to create a unit test for it and push it up, if it's failing I'll fix. Though, I think it'll pass ;). |
|
Okay, let me try and reproduce when I'm back. I remember looking into this a little while ago and the v8 --> v9 conversion could sometimes produce 2 entries. Of course I could just be imaging things :) |
|
If v8 -> v9 conversion is producing two entries, that's a bug and should be fixed. |
There was a problem hiding this comment.
Wait, I think I found them as part of Timeseries query runner test.
There was a problem hiding this comment.
There are no direct tests, that's true. It is tested through the various queries, but some direct tests would probably also be meaningful.
|
Scala tends to treat things as empty rather than null, @cheddar : can you please comment on why you would like to use null to mean empty? (as opposed to having empty meaning empty, and storing empty variables instead of null variables) |
|
@drcrallen I'm not sure what you are asking, which part of the code are you talking about? Or are you just asking a philosphical question about the use of |
There was a problem hiding this comment.
Please make sure there is a test case for the following scenario:
Dim contains only null values, filter is a not filter of a selector of any value for that dimension.
Expected results should be all values.
Note this varies from traditional SQL, where the expected results would be no values.
There was a problem hiding this comment.
I don't quite understand this. I believe you are saying that if I do a timeseries query that filters on value "xyz" on a dimension that doesn't exist (i.e. only null), the filter should match everything?
That seems to be rather unexpected and against the general semantics that this patch is providing.
There was a problem hiding this comment.
@cheddar : not that filters to include xyz, but rather if you have a dimension, let's say "cake", which never actually have a value for (all null) and you say "give me all events where cake is not cheesecake", then I would expect it to return all events.
This contradicts traditional sql, where the not of a null is a null.
|
@cheddar : Higher question overall. But from a conceptual standpoint null and empty are distinct ideas. I get a bit nervous throwing null around as a valid value because it's confusing if it means "error" or "empty", and throwing nulls around on purpose is an easy way to accidentally NPE somewhere. Lots of other druid code uses empty sets instead of null, and I was wondering if you have a compelling through on why things should map to null instead of to an empty object. Additionally, as per the SQL example mentioned previously, it helps if we have just one representation that is not confusing to folks coming from other data stores. |
|
Ok, I've updated everything except for adding tests specific to Fwiw, I went ahead and tried creating a Column implementation that acts just like a single valued null column. It was pretty hard to integrate and ended up breaking some unit tests because the "null handling" logic can take a lot more into account when it is defining "correct" behavior while creating a blanket "null column" doesn't. One of the big places that this showed up was in the differences between how metric and dimension columns are treated. With just a @drcrallen hopefully that resolves the On the question of "if you pass nulls, now I have to check for them" if you are working in Java and you are operating with Objects you should _always_ be doing null checks. I don't care if it's an Lastly, I'm do not understand what you are trying to say with
|
|
@cheddar Hi, I haven't looked at the code but from your description it sounded like it might be a case where double dispatching might work well. The idea is that object Foo askes the null object a question with itself as a parameter and then then Null object calls back on foo for portions of the info it needs. Trying to picture if that's a good solution here... |
|
@sharrissf I think that at this point, it's more trouble than it is worth to "fix" the issue. From the initial attempt at changing things a bit, I had to touch a lot more than the 3 places that this is touching to handling nulls better. It's definitely a task that we can take on in a long-term fashion, but I don't think the code base is quite ready for it yet. @drcrallen Thanks for the clarification on the unit test. I added the test that you requested. |
There was a problem hiding this comment.
For consistency, let's try to move everything to Strings.isNullOrEmpty, which may also be faster, since it does not use .equals.
|
👍 once Travis CI verifies build. |
There was a problem hiding this comment.
missing newline at end of file (Preferences -> Editor -> General -> Ensure line feed at file end in IntelliJ)
There was a problem hiding this comment.
Is that part of your style? I believe I've always had it set to not have lines at the end of file, 'cause they annoy me ;).
Btw, if we want to enforce some sort of style on the code, let's not do it with nit-picky comments like this, let's do it by publishing style configs.
|
Can we add a groupBy and topN test with columns containing null / empty string values along with regular values, to make sure we define the expected behavior? |
|
@xvrl Tests do exist for null values, can you look at those and ask more specifically for what you think is not covered? |
|
@xvrl I just pushed something that adds the various "singleton" objects you asked for. |
|
Re tests with columns containing null values, I meant something like this https://github.com/druid-io/druid/pull/495/files#diff-4fc7397755af8d2c280382f211527e23R1186 where the topN is done over an existing column where some values are null or transformed into nulls. Same thing for groupBy, have somethings like this https://github.com/druid-io/druid/pull/495/files#diff-fdc478cff93cb2313327587da0c37157R255 If you want I can add them to your branch directly. |
|
Ah, ok. Yeah, I don't have tests on a column that mixes nulls with non-nulls. IIRC, I didn't do that because I don't believe we have a test data set that provides that. If you would be willing to provide some tests on the branch (passing or no), that'd be awesome. If they aren't passing, I'll make them pass. Also, in the diffs you linked, I noticed you were doing a little bit of jiggering to get around ImmutableMap not allowing a "null" key. You can also just switch to using a normal Map. I.e. instead of You can do And it will work. This incantation is fine for tests, but it's not great for actual production code (it's sub-classing LinkedHashMap and adding a new constructor after the super constructor). |
|
@cheddar cool, yeah, abstraction and factoring are things best timeboxed and improved over time |
|
@xvrl unit tests fixed |
daeb2e7 to
23bd58a
Compare
|
I did a quick test replacing |
|
@cheddar Apparently #1014 is not fixed by this pull request. I am +1 on merging this pull request as it makes progress on how Druid handles nulls, but there are still numerous problems with handling nulls and empty strings in Druid and I think we should consider fixing them sooner rather than later. |
|
Can this get squashed/rebased before merging? |
|
Actually we are having this problems with null handlers answering wrong whenever you merge it let me know and I will test it on our labs Thanks for all Pablo Nebrera Pablo Nebrera Herrera https://twitter.com/redBorder_net Piénsalo antes de imprimir este mensaje Este correo electrónico, incluidos sus anexos, se dirige exclusivamente a This email, including attachments, is intended exclusively for its 2015-01-27 23:56 GMT+01:00 Fangjin Yang notifications@github.com:
|
|
@drcrallen yes, it can, but only when there's actually some indication that the code could get merged. I'm not interested in wasting time making it mergeable again only to have more merge conflicts introduced later down the line because the PR cannot be merged for political reasons. |
8b0ec82 to
d05032b
Compare
This commit also includes 1) the addition of a context parameter on timeseries queries that allows it to ignore empty buckets instead of generating results for them 2) A cleanup of an unused method on an interface
This PR includes a number of fixes to handle nulls more consistently on the query side. These fixes were done to support a user who is leveraging the schema-less capabilities of Druid. It's sufficient for all needs we've currently found, but I am not certain that it is comprehensive just yet.
Fixes #665 in the following manner
Also, on the PR is the addition of a context parameter on timeseries queries that allows it to ignore empty buckets. I realize that this should've been separated into two PRs, but there is a bit of context behind these commits that makes that difficult. It's just a few changes in
TimeseriesQuery,TimeseriesQueryEngineandTimeseriesQueryRunnerTest