Skip to content

More ParseException handling for numeric dimensions#5312

Merged
gianm merged 5 commits intoapache:masterfrom
jon-wei:npe_numeric_fix
Feb 6, 2018
Merged

More ParseException handling for numeric dimensions#5312
gianm merged 5 commits intoapache:masterfrom
jon-wei:npe_numeric_fix

Conversation

@jon-wei
Copy link
Copy Markdown
Contributor

@jon-wei jon-wei commented Jan 31, 2018

Fixes #4879

This is a different approach to #4509

The patch also adjusts InputRowSerde to serialize dimension values with the types specified in the ingestion schema, instead of serializing all values as Strings. ParseException handling for dimensions has been added to InputRowSerde.toBytes();

Regarding null/unparseable value handling:

  • Unparseable strings in numeric dimensions will cause the column value to be replaced with a default zero when reportParseExceptions is false (this null->zero conversion should be removed when the SQL compatible Null Handling Part 1 - Expressions and Storage Changes #5278 series of patches is merged). When reportParseExceptions is true, unparseable strings in a numeric dimension will throw ParseException.
  • The empty string in a numeric dimension is an unparseable value, it is not equivalent to null
  • An actual null value (if a row explictly has a null value for a numeric dimension, or if the row is missing that dimension altogether) will still be converted to 0 for now. This should also change after SQL compatible Null Handling Part 1 - Expressions and Storage Changes #5278 allows Druid to store nulls for numeric columns

@jon-wei jon-wei changed the title Discard rows with unparseable numeric dimensions [WIP] Discard rows with unparseable numeric dimensions Jan 31, 2018
@jon-wei jon-wei changed the title [WIP] Discard rows with unparseable numeric dimensions Discard rows with unparseable numeric dimensions Jan 31, 2018
@jon-wei jon-wei added this to the 0.12.0 milestone Jan 31, 2018
.findCounter(HadoopDruidIndexerConfig.IndexJobCounters.INVALID_ROW_COUNTER);
jobStats.setInvalidRowCount(invalidRowCount.getValue());
Counters counters = job.getCounters();
if (counters == null) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why'd you need to add these? Why would the counters not be set?

Copy link
Copy Markdown
Contributor Author

@jon-wei jon-wei Feb 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It happened on a task failure when I was testing the ParseExceptions, didn't look further into why the job counters weren't set

return newIndex;
}


Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too many newlines here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

}
}

public static final byte[] toBytes(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add @Nullable and a javadoc explaining what it means when this returns null (probably it means there was a parse exception and reportParseExceptions is false? Is it possible for this to return null if reportParseExceptions is true?).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might never return null if you take one of the other suggestions (skip column instead of skip whole row on parse error).

Map<String, IndexSerdeTypeHelper> typeHelperMap = Maps.newHashMap();
for (DimensionSchema dimensionSchema : dimensionsSpec.getDimensions()) {
IndexSerdeTypeHelper typeHelper;
switch (dimensionSchema.getValueType()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace the switch with VALUE_TYPE_HELPER_ARRAY[dimensionSchema.getValueType().ordinal()] ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept the switch statement but dropped the array, it wasn't being used anymore after removing the type ordinals from the serialized form

{
ValueType getType();

byte[] serialize(Object value);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

void serialize(DataOutput out, Object value) would be better - avoid needless byte[] allocations.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to void serialize(ByteArrayDataOutput out, Object value)


byte[] serialize(Object value);

T deserialize(byte[] bytes);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

T deserialize(DataInput in) is better for similar reasons.

byte[] serializedRow = InputRowSerde.toBytes(typeHelperMap, inputRow, combiningAggs, true);

if (serializedRow != null) {
// reportParseExceptions is true as any unparseable data is already handled by the mapper.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is attached to the wrong line. It should go with toBytes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved comment

if (reportParseExceptions) {
throw pe;
} else {
// discard the row if there was a parse error in a dimension
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to just use a "default" value for this column like zero. It's more in line with what aggregators do (see the below log message "Encountered parse error, skipping aggregator"). That way, at least we retain something from the row (namely: all other columns).

InputRowSerde.toBytes(typeHelperMap, inputRow, aggregators, reportParseExceptions);

if (serializedInputRow == null) {
return;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should increment INVALID_ROW_COUNTER and log a debug message. However, this comment may be moot if you take the suggestion to skip just the column and not the entire row.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a counter increment and debug message for now

}

return DimensionHandlerUtils.convertObjectToLong(dimValues);
return DimensionHandlerUtils.convertObjectToLong(dimValues, true);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we get an unparseable string value here, and reportParseExceptions is false, I think it'd be better to replace it with zero than to skip the entire row. That way we aren't throwing away as much data.

If reportParseExceptions is true then yes, we should throw an exception out.

(+ similar comment for other numeric types)

@jon-wei
Copy link
Copy Markdown
Contributor Author

jon-wei commented Feb 1, 2018

@gianm I addressed your comments aside from the ones regarding rejecting entire rows on parse errors vs. replacing unparseable values with a default value.

I went with the row rejection approach since I think ingesting a partially correct row is introducing more "corruption" into the data than just throwing the row away, on some datasets zero could be a significant value for a numeric dimension and adding the unparseable rows would lead to confusing results.

I saw that the metrics handle parse exceptions by skipping the aggregator, but I also lean towards throwing entire rows away in that case (maybe you have a set of metrics that are somewhat related/correlated, like maybe you have a metric for a total count + a metric for a related flow rate, if there's a parse error in one of the values I think it's arguably weird to ingest the row), that would require a significant refactor though I believe.

I don't feel super strongly about this though, and it's not absolutely clear to me that "partially correct" rows are better/worse than missing rows, so I'll defer to your judgment or the community's if others wish to comment on this.

@gianm
Copy link
Copy Markdown
Contributor

gianm commented Feb 1, 2018

@jon-wei my feeling is that if you want total integrity then that is what reportParseExceptions is for. You will know for sure if you get a bad value because ingestion will just stop and wait for you to fix it. If you don't set reportParseExceptions then I think it makes sense to ingest whatever columns we can ingest. It means that for queries that don't touch this column then they will be more correct (since all other columns are fine). Queries that do touch this column will be worse off, but there are more other-columns than this-columns, so I think the balance is in favor of zeroing out the column and keeping the rest of the row.

@jon-wei jon-wei changed the title Discard rows with unparseable numeric dimensions More ParseException handling for numeric dimensions Feb 1, 2018
@jon-wei
Copy link
Copy Markdown
Contributor Author

jon-wei commented Feb 1, 2018

@gianm That reasoning sounds good to me, I've updated the patch to replace unparseable numeric dimension values with zeros when reportParseExceptions is false

Copy link
Copy Markdown
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 LGTM

@jihoonson
Copy link
Copy Markdown
Contributor

I'm reviewing this PR.

Copy link
Copy Markdown
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jon-wei this PR looks good to me overall. I have one additional question. What do you think about supporting configurable default values for unparseable values? Users may use their domain knowledge to distinguish nulls and 0s until #5278. For example, if users are aware of that a column can't have a negative value, they can use -1 as the default value for unparseable values.

public ValueMatcher makeValueMatcher(final BaseDoubleColumnValueSelector selector, final String value)
{
final Double matchVal = DimensionHandlerUtils.convertObjectToDouble(value);
final Double matchVal = DimensionHandlerUtils.convertObjectToDouble(value, false);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Probably better to add a method DimensionHandlerUtils.convertObjectToDouble(value) which internally calls DimensionHandlerUtils.convertObjectToDouble(value, false) for convenience.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a method without the parseException boolean

@jon-wei
Copy link
Copy Markdown
Contributor Author

jon-wei commented Feb 2, 2018

@jihoonson

What do you think about supporting configurable default values for unparseable values?

I think that'd be a useful feature.

I think it would be better to implement default values for unparseable values in a separate PR (this one is mainly to fix an NPE bug), and get #5278 in before that patch to avoid introducing more potential conflicts there since that's a big "Development Blocker" patch that's been open for a while.

@jihoonson
Copy link
Copy Markdown
Contributor

@jon-wei ok. That sounds good to me.

Copy link
Copy Markdown
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 after travis.

@gianm gianm merged commit 285dedd into apache:master Feb 6, 2018
gianm pushed a commit to gianm/druid that referenced this pull request Feb 6, 2018
* Discard rows with unparseable numeric dimensions

* PR comments

* Don't throw away entire row on parse exception

* PR comments

* Fix import
jon-wei pushed a commit that referenced this pull request Feb 7, 2018
* Discard rows with unparseable numeric dimensions

* PR comments

* Don't throw away entire row on parse exception

* PR comments

* Fix import
jon-wei pushed a commit to implydata/druid-public that referenced this pull request Feb 7, 2018
…ache#5356)

* Discard rows with unparseable numeric dimensions

* PR comments

* Don't throw away entire row on parse exception

* PR comments

* Fix import
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants