Handle missing values for delimited text files when Nullhandling is enabled#8779
Handle missing values for delimited text files when Nullhandling is enabled#8779clintropolis merged 10 commits intoapache:masterfrom
Conversation
|
@nishantmonu51 Could you please review this? |
| ? (USE_DEFAULT_VALUE_FOR_NULL | ||
| ? "Unparseable timestamp found! Event: {t=bad_timestamp, dim1=foo, dim2=null, met1=6}" | ||
| : "Unparseable timestamp found! Event: {t=bad_timestamp, dim1=foo, dim2=, met1=6}") | ||
| ? "Unparseable timestamp found! Event: {t=bad_timestamp, dim1=foo, dim2=null, met1=6}" |
There was a problem hiding this comment.
@nishantmonu51 Please let me know if you think there is a better way to do this.
With the CSVParser fix in this PR, the way to specify empty values for csv files would be ""
One of the test data samples is bad_timestamp,foo,,6 and dim2 would be null regardless of whether null handling is enabled or not.
There was a problem hiding this comment.
I think this should be ok. @nishantmonu51 do you have something in your mind?
|
@jihoonson Would you have time to review this? |
|
@a2l007 sorry for the delayed review. I can take a look today. Would you please fix the conflicts? |
|
@jihoonson Oops yeah fixed them now. |
jihoonson
left a comment
There was a problem hiding this comment.
The overall change looks good to me. Left a couple of comments.
| "2011-01-13T00:00:00.000Z,product_2,t3\tt4\tt5,u3\tu4", | ||
| "2011-01-14T00:00:00.000Z,product_3,t5\tt6\tt7,u1\tu5", | ||
| "2011-01-14T00:00:00.000Z,product_4,,u2" | ||
| "2011-01-14T00:00:00.000Z,product_4,\"\",u2" |
There was a problem hiding this comment.
What is this change for?
There was a problem hiding this comment.
I think this is to preserve the original test, since previously csv parser would always produce empty strings here, but after this change in null compatible mode it will produce a null
There was a problem hiding this comment.
Thanks for clarifying this. I was too late 💤
| ? (USE_DEFAULT_VALUE_FOR_NULL | ||
| ? "Unparseable timestamp found! Event: {t=bad_timestamp, dim1=foo, dim2=null, met1=6}" | ||
| : "Unparseable timestamp found! Event: {t=bad_timestamp, dim1=foo, dim2=, met1=6}") | ||
| ? "Unparseable timestamp found! Event: {t=bad_timestamp, dim1=foo, dim2=null, met1=6}" |
There was a problem hiding this comment.
I think this should be ok. @nishantmonu51 do you have something in your mind?
| private final RFC4180Parser parser = new RFC4180Parser(); | ||
| private final RFC4180Parser parser = NullHandling.replaceWithDefault() | ||
| ? new RFC4180Parser() | ||
| : new RFC4180ParserBuilder().withFieldAsNull( |
There was a problem hiding this comment.
I wonder if this could just be the all the time mode, since I think it wouldn't really change anything for the default mode either since the values would be coerced to ''.
There was a problem hiding this comment.
The intention was to preserve backward compatibility but thinking about it, the empty separators flag might not impact the default mode. I'll do a few tests and raise a PR to make CSVReaderNullFieldIndicator.EMPTY_SEPARATORSthe default mode if it goes well.
…nabled (apache#8779) * Handle missing values * Fix multi value tests * Fix firehose tests * Fix conflicts
Fixes #8778 .
Description
For delimited text file inputs, empty and missing values are parsed as the same when
-Ddruid.generic.useDefaultValueForNull=falseis enabled.The delimited and csv parsers are therefore modified to differentiate between missing and empty values.
This PR has: