Skip to content

Handle missing values for delimited text files when Nullhandling is enabled#8779

Merged
clintropolis merged 10 commits intoapache:masterfrom
a2l007:handle_nulls
Nov 20, 2019
Merged

Handle missing values for delimited text files when Nullhandling is enabled#8779
clintropolis merged 10 commits intoapache:masterfrom
a2l007:handle_nulls

Conversation

@a2l007
Copy link
Copy Markdown
Contributor

@a2l007 a2l007 commented Oct 29, 2019

Fixes #8778 .

Description

For delimited text file inputs, empty and missing values are parsed as the same when -Ddruid.generic.useDefaultValueForNull=false is enabled.
The delimited and csv parsers are therefore modified to differentiate between missing and empty values.


This PR has:

  • [x ] been self-reviewed.
  • [x ] added unit tests or modified existing tests to cover new code paths.
  • [x ] been tested in a test Druid cluster.

@a2l007
Copy link
Copy Markdown
Contributor Author

a2l007 commented Oct 31, 2019

@nishantmonu51 Could you please review this?

Copy link
Copy Markdown
Member

@nishantmonu51 nishantmonu51 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, post travis

? (USE_DEFAULT_VALUE_FOR_NULL
? "Unparseable timestamp found! Event: {t=bad_timestamp, dim1=foo, dim2=null, met1=6}"
: "Unparseable timestamp found! Event: {t=bad_timestamp, dim1=foo, dim2=, met1=6}")
? "Unparseable timestamp found! Event: {t=bad_timestamp, dim1=foo, dim2=null, met1=6}"
Copy link
Copy Markdown
Contributor Author

@a2l007 a2l007 Nov 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nishantmonu51 Please let me know if you think there is a better way to do this.
With the CSVParser fix in this PR, the way to specify empty values for csv files would be ""
One of the test data samples is bad_timestamp,foo,,6 and dim2 would be null regardless of whether null handling is enabled or not.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be ok. @nishantmonu51 do you have something in your mind?

@a2l007
Copy link
Copy Markdown
Contributor Author

a2l007 commented Nov 15, 2019

@jihoonson Would you have time to review this?

@jihoonson
Copy link
Copy Markdown
Contributor

@a2l007 sorry for the delayed review. I can take a look today. Would you please fix the conflicts?

@a2l007
Copy link
Copy Markdown
Contributor Author

a2l007 commented Nov 19, 2019

@jihoonson Oops yeah fixed them now.

Copy link
Copy Markdown
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall change looks good to me. Left a couple of comments.

"2011-01-13T00:00:00.000Z,product_2,t3\tt4\tt5,u3\tu4",
"2011-01-14T00:00:00.000Z,product_3,t5\tt6\tt7,u1\tu5",
"2011-01-14T00:00:00.000Z,product_4,,u2"
"2011-01-14T00:00:00.000Z,product_4,\"\",u2"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this change for?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is to preserve the original test, since previously csv parser would always produce empty strings here, but after this change in null compatible mode it will produce a null

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying this. I was too late 💤

? (USE_DEFAULT_VALUE_FOR_NULL
? "Unparseable timestamp found! Event: {t=bad_timestamp, dim1=foo, dim2=null, met1=6}"
: "Unparseable timestamp found! Event: {t=bad_timestamp, dim1=foo, dim2=, met1=6}")
? "Unparseable timestamp found! Event: {t=bad_timestamp, dim1=foo, dim2=null, met1=6}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be ok. @nishantmonu51 do you have something in your mind?

Copy link
Copy Markdown
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm 👍

private final RFC4180Parser parser = new RFC4180Parser();
private final RFC4180Parser parser = NullHandling.replaceWithDefault()
? new RFC4180Parser()
: new RFC4180ParserBuilder().withFieldAsNull(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this could just be the all the time mode, since I think it wouldn't really change anything for the default mode either since the values would be coerced to ''.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention was to preserve backward compatibility but thinking about it, the empty separators flag might not impact the default mode. I'll do a few tests and raise a PR to make CSVReaderNullFieldIndicator.EMPTY_SEPARATORSthe default mode if it goes well.

@clintropolis clintropolis merged commit f5fbd0b into apache:master Nov 20, 2019
jon-wei pushed a commit to jon-wei/druid that referenced this pull request Nov 26, 2019
…nabled (apache#8779)

* Handle missing values

* Fix multi value tests

* Fix firehose tests

* Fix conflicts
@jon-wei jon-wei added this to the 0.17.0 milestone Dec 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Missing dimensions/metrics are parsed as empty values for delimited text files

5 participants