More ParseException handling for numeric dimensions#5312
More ParseException handling for numeric dimensions#5312gianm merged 5 commits intoapache:masterfrom
Conversation
a574487 to
d1b5431
Compare
| .findCounter(HadoopDruidIndexerConfig.IndexJobCounters.INVALID_ROW_COUNTER); | ||
| jobStats.setInvalidRowCount(invalidRowCount.getValue()); | ||
| Counters counters = job.getCounters(); | ||
| if (counters == null) { |
There was a problem hiding this comment.
Why'd you need to add these? Why would the counters not be set?
There was a problem hiding this comment.
It happened on a task failure when I was testing the ParseExceptions, didn't look further into why the job counters weren't set
| return newIndex; | ||
| } | ||
|
|
||
|
|
| } | ||
| } | ||
|
|
||
| public static final byte[] toBytes( |
There was a problem hiding this comment.
Please add @Nullable and a javadoc explaining what it means when this returns null (probably it means there was a parse exception and reportParseExceptions is false? Is it possible for this to return null if reportParseExceptions is true?).
There was a problem hiding this comment.
I think this might never return null if you take one of the other suggestions (skip column instead of skip whole row on parse error).
| Map<String, IndexSerdeTypeHelper> typeHelperMap = Maps.newHashMap(); | ||
| for (DimensionSchema dimensionSchema : dimensionsSpec.getDimensions()) { | ||
| IndexSerdeTypeHelper typeHelper; | ||
| switch (dimensionSchema.getValueType()) { |
There was a problem hiding this comment.
Replace the switch with VALUE_TYPE_HELPER_ARRAY[dimensionSchema.getValueType().ordinal()] ?
There was a problem hiding this comment.
I kept the switch statement but dropped the array, it wasn't being used anymore after removing the type ordinals from the serialized form
| { | ||
| ValueType getType(); | ||
|
|
||
| byte[] serialize(Object value); |
There was a problem hiding this comment.
void serialize(DataOutput out, Object value) would be better - avoid needless byte[] allocations.
There was a problem hiding this comment.
changed to void serialize(ByteArrayDataOutput out, Object value)
|
|
||
| byte[] serialize(Object value); | ||
|
|
||
| T deserialize(byte[] bytes); |
There was a problem hiding this comment.
T deserialize(DataInput in) is better for similar reasons.
| byte[] serializedRow = InputRowSerde.toBytes(typeHelperMap, inputRow, combiningAggs, true); | ||
|
|
||
| if (serializedRow != null) { | ||
| // reportParseExceptions is true as any unparseable data is already handled by the mapper. |
There was a problem hiding this comment.
This comment is attached to the wrong line. It should go with toBytes.
| if (reportParseExceptions) { | ||
| throw pe; | ||
| } else { | ||
| // discard the row if there was a parse error in a dimension |
There was a problem hiding this comment.
I think it would be better to just use a "default" value for this column like zero. It's more in line with what aggregators do (see the below log message "Encountered parse error, skipping aggregator"). That way, at least we retain something from the row (namely: all other columns).
| InputRowSerde.toBytes(typeHelperMap, inputRow, aggregators, reportParseExceptions); | ||
|
|
||
| if (serializedInputRow == null) { | ||
| return; |
There was a problem hiding this comment.
This should increment INVALID_ROW_COUNTER and log a debug message. However, this comment may be moot if you take the suggestion to skip just the column and not the entire row.
There was a problem hiding this comment.
Added a counter increment and debug message for now
| } | ||
|
|
||
| return DimensionHandlerUtils.convertObjectToLong(dimValues); | ||
| return DimensionHandlerUtils.convertObjectToLong(dimValues, true); |
There was a problem hiding this comment.
If we get an unparseable string value here, and reportParseExceptions is false, I think it'd be better to replace it with zero than to skip the entire row. That way we aren't throwing away as much data.
If reportParseExceptions is true then yes, we should throw an exception out.
(+ similar comment for other numeric types)
|
@gianm I addressed your comments aside from the ones regarding rejecting entire rows on parse errors vs. replacing unparseable values with a default value. I went with the row rejection approach since I think ingesting a partially correct row is introducing more "corruption" into the data than just throwing the row away, on some datasets zero could be a significant value for a numeric dimension and adding the unparseable rows would lead to confusing results. I saw that the metrics handle parse exceptions by skipping the aggregator, but I also lean towards throwing entire rows away in that case (maybe you have a set of metrics that are somewhat related/correlated, like maybe you have a metric for a total count + a metric for a related flow rate, if there's a parse error in one of the values I think it's arguably weird to ingest the row), that would require a significant refactor though I believe. I don't feel super strongly about this though, and it's not absolutely clear to me that "partially correct" rows are better/worse than missing rows, so I'll defer to your judgment or the community's if others wish to comment on this. |
|
@jon-wei my feeling is that if you want total integrity then that is what |
a637336 to
d3f0cab
Compare
|
@gianm That reasoning sounds good to me, I've updated the patch to replace unparseable numeric dimension values with zeros when reportParseExceptions is false |
|
I'm reviewing this PR. |
jihoonson
left a comment
There was a problem hiding this comment.
@jon-wei this PR looks good to me overall. I have one additional question. What do you think about supporting configurable default values for unparseable values? Users may use their domain knowledge to distinguish nulls and 0s until #5278. For example, if users are aware of that a column can't have a negative value, they can use -1 as the default value for unparseable values.
| public ValueMatcher makeValueMatcher(final BaseDoubleColumnValueSelector selector, final String value) | ||
| { | ||
| final Double matchVal = DimensionHandlerUtils.convertObjectToDouble(value); | ||
| final Double matchVal = DimensionHandlerUtils.convertObjectToDouble(value, false); |
There was a problem hiding this comment.
nit: Probably better to add a method DimensionHandlerUtils.convertObjectToDouble(value) which internally calls DimensionHandlerUtils.convertObjectToDouble(value, false) for convenience.
There was a problem hiding this comment.
Added a method without the parseException boolean
I think that'd be a useful feature. I think it would be better to implement default values for unparseable values in a separate PR (this one is mainly to fix an NPE bug), and get #5278 in before that patch to avoid introducing more potential conflicts there since that's a big "Development Blocker" patch that's been open for a while. |
|
@jon-wei ok. That sounds good to me. |
* Discard rows with unparseable numeric dimensions * PR comments * Don't throw away entire row on parse exception * PR comments * Fix import
…ache#5356) * Discard rows with unparseable numeric dimensions * PR comments * Don't throw away entire row on parse exception * PR comments * Fix import
Fixes #4879
This is a different approach to #4509
The patch also adjusts InputRowSerde to serialize dimension values with the types specified in the ingestion schema, instead of serializing all values as Strings. ParseException handling for dimensions has been added to InputRowSerde.toBytes();
Regarding null/unparseable value handling: