More ParseException handling for numeric dimensions by jon-wei · Pull Request #5312 · apache/druid

jon-wei · 2018-01-31T01:13:41Z

This is a different approach to #4509

The patch also adjusts InputRowSerde to serialize dimension values with the types specified in the ingestion schema, instead of serializing all values as Strings. ParseException handling for dimensions has been added to InputRowSerde.toBytes();

Regarding null/unparseable value handling:

Unparseable strings in numeric dimensions will cause the column value to be replaced with a default zero when reportParseExceptions is false (this null->zero conversion should be removed when the SQL compatible Null Handling Part 1 - Expressions and Storage Changes #5278 series of patches is merged). When reportParseExceptions is true, unparseable strings in a numeric dimension will throw ParseException.
The empty string in a numeric dimension is an unparseable value, it is not equivalent to null
An actual null value (if a row explictly has a null value for a numeric dimension, or if the row is missing that dimension altogether) will still be converted to 0 for now. This should also change after SQL compatible Null Handling Part 1 - Expressions and Storage Changes #5278 allows Druid to store nulls for numeric columns

gianm · 2018-02-01T00:32:55Z

-                                   .findCounter(HadoopDruidIndexerConfig.IndexJobCounters.INVALID_ROW_COUNTER);
-      jobStats.setInvalidRowCount(invalidRowCount.getValue());
+      Counters counters = job.getCounters();
+      if (counters == null) {


Why'd you need to add these? Why would the counters not be set?

It happened on a task failure when I was testing the ParseExceptions, didn't look further into why the job counters weren't set

gianm · 2018-02-01T00:33:04Z

    return newIndex;
  }

+


Too many newlines here.

gianm · 2018-02-01T00:34:56Z

+    }
+  }
+
+  public static final byte[] toBytes(


Please add @Nullable and a javadoc explaining what it means when this returns null (probably it means there was a parse exception and reportParseExceptions is false? Is it possible for this to return null if reportParseExceptions is true?).

I think this might never return null if you take one of the other suggestions (skip column instead of skip whole row on parse error).

gianm · 2018-02-01T00:39:27Z

+    Map<String, IndexSerdeTypeHelper> typeHelperMap = Maps.newHashMap();
+    for (DimensionSchema dimensionSchema : dimensionsSpec.getDimensions()) {
+      IndexSerdeTypeHelper typeHelper;
+      switch (dimensionSchema.getValueType()) {


Replace the switch with VALUE_TYPE_HELPER_ARRAY[dimensionSchema.getValueType().ordinal()] ?

I kept the switch statement but dropped the array, it wasn't being used anymore after removing the type ordinals from the serialized form

gianm · 2018-02-01T00:42:54Z

+  {
+    ValueType getType();
+
+    byte[] serialize(Object value);


void serialize(DataOutput out, Object value) would be better - avoid needless byte[] allocations.

changed to void serialize(ByteArrayDataOutput out, Object value)

gianm · 2018-02-01T00:50:03Z

+
+    byte[] serialize(Object value);
+
+    T deserialize(byte[] bytes);


T deserialize(DataInput in) is better for similar reasons.

gianm · 2018-02-01T00:53:51Z

+        byte[] serializedRow = InputRowSerde.toBytes(typeHelperMap, inputRow, combiningAggs, true);
+
+        if (serializedRow != null) {
+          // reportParseExceptions is true as any unparseable data is already handled by the mapper.


This comment is attached to the wrong line. It should go with toBytes.

Moved comment

gianm · 2018-02-01T00:57:28Z

+            if (reportParseExceptions) {
+              throw pe;
+            } else {
+              // discard the row if there was a parse error in a dimension


I think it would be better to just use a "default" value for this column like zero. It's more in line with what aggregators do (see the below log message "Encountered parse error, skipping aggregator"). That way, at least we retain something from the row (namely: all other columns).

gianm · 2018-02-01T00:59:41Z

+                                  InputRowSerde.toBytes(typeHelperMap, inputRow, aggregators, reportParseExceptions);
+
+      if (serializedInputRow == null) {
+        return;


This should increment INVALID_ROW_COUNTER and log a debug message. However, this comment may be moot if you take the suggestion to skip just the column and not the entire row.

Added a counter increment and debug message for now

gianm · 2018-02-01T01:02:22Z

    }

-    return DimensionHandlerUtils.convertObjectToLong(dimValues);
+    return DimensionHandlerUtils.convertObjectToLong(dimValues, true);


If we get an unparseable string value here, and reportParseExceptions is false, I think it'd be better to replace it with zero than to skip the entire row. That way we aren't throwing away as much data.

If reportParseExceptions is true then yes, we should throw an exception out.

(+ similar comment for other numeric types)

jon-wei · 2018-02-01T03:18:22Z

@gianm I addressed your comments aside from the ones regarding rejecting entire rows on parse errors vs. replacing unparseable values with a default value.

I went with the row rejection approach since I think ingesting a partially correct row is introducing more "corruption" into the data than just throwing the row away, on some datasets zero could be a significant value for a numeric dimension and adding the unparseable rows would lead to confusing results.

I saw that the metrics handle parse exceptions by skipping the aggregator, but I also lean towards throwing entire rows away in that case (maybe you have a set of metrics that are somewhat related/correlated, like maybe you have a metric for a total count + a metric for a related flow rate, if there's a parse error in one of the values I think it's arguably weird to ingest the row), that would require a significant refactor though I believe.

I don't feel super strongly about this though, and it's not absolutely clear to me that "partially correct" rows are better/worse than missing rows, so I'll defer to your judgment or the community's if others wish to comment on this.

gianm · 2018-02-01T03:40:43Z

@jon-wei my feeling is that if you want total integrity then that is what reportParseExceptions is for. You will know for sure if you get a bad value because ingestion will just stop and wait for you to fix it. If you don't set reportParseExceptions then I think it makes sense to ingest whatever columns we can ingest. It means that for queries that don't touch this column then they will be more correct (since all other columns are fine). Queries that do touch this column will be worse off, but there are more other-columns than this-columns, so I think the balance is in favor of zeroing out the column and keeping the rest of the row.

jon-wei · 2018-02-01T22:08:08Z

@gianm That reasoning sounds good to me, I've updated the patch to replace unparseable numeric dimension values with zeros when reportParseExceptions is false

gianm

👍 LGTM

jihoonson · 2018-02-02T22:06:08Z

I'm reviewing this PR.

jihoonson

@jon-wei this PR looks good to me overall. I have one additional question. What do you think about supporting configurable default values for unparseable values? Users may use their domain knowledge to distinguish nulls and 0s until #5278. For example, if users are aware of that a column can't have a negative value, they can use -1 as the default value for unparseable values.

jihoonson · 2018-02-02T22:21:49Z

  public ValueMatcher makeValueMatcher(final BaseDoubleColumnValueSelector selector, final String value)
  {
-    final Double matchVal = DimensionHandlerUtils.convertObjectToDouble(value);
+    final Double matchVal = DimensionHandlerUtils.convertObjectToDouble(value, false);


nit: Probably better to add a method DimensionHandlerUtils.convertObjectToDouble(value) which internally calls DimensionHandlerUtils.convertObjectToDouble(value, false) for convenience.

Added a method without the parseException boolean

jon-wei · 2018-02-02T22:51:04Z

@jihoonson

What do you think about supporting configurable default values for unparseable values?

I think that'd be a useful feature.

I think it would be better to implement default values for unparseable values in a separate PR (this one is mainly to fix an NPE bug), and get #5278 in before that patch to avoid introducing more potential conflicts there since that's a big "Development Blocker" patch that's been open for a while.

jihoonson · 2018-02-02T22:53:04Z

@jon-wei ok. That sounds good to me.

jihoonson

+1 after travis.

* Discard rows with unparseable numeric dimensions * PR comments * Don't throw away entire row on parse exception * PR comments * Fix import

…ache#5356) * Discard rows with unparseable numeric dimensions * PR comments * Don't throw away entire row on parse exception * PR comments * Fix import

jon-wei added Bug Area - Batch Ingestion labels Jan 31, 2018

jon-wei changed the title ~~Discard rows with unparseable numeric dimensions~~ [WIP] Discard rows with unparseable numeric dimensions Jan 31, 2018

Discard rows with unparseable numeric dimensions

d1b5431

jon-wei force-pushed the npe_numeric_fix branch from a574487 to d1b5431 Compare January 31, 2018 21:27

jon-wei changed the title ~~[WIP] Discard rows with unparseable numeric dimensions~~ Discard rows with unparseable numeric dimensions Jan 31, 2018

jon-wei added this to the 0.12.0 milestone Jan 31, 2018

jon-wei mentioned this pull request Jan 31, 2018

Fix NPE when indexing unparseable/missing numeric values when sortFacts=false #4509

Closed

leventov added the Design Review label Jan 31, 2018

gianm reviewed Feb 1, 2018

View reviewed changes

PR comments

f4cb2c5

Don't throw away entire row on parse exception

d3f0cab

jon-wei force-pushed the npe_numeric_fix branch from a637336 to d3f0cab Compare February 1, 2018 21:56

jon-wei changed the title ~~Discard rows with unparseable numeric dimensions~~ More ParseException handling for numeric dimensions Feb 1, 2018

gianm approved these changes Feb 2, 2018

View reviewed changes

jihoonson reviewed Feb 2, 2018

View reviewed changes

PR comments

3991e25

jihoonson approved these changes Feb 2, 2018

View reviewed changes

Fix import

7311fca

gianm merged commit 285dedd into apache:master Feb 6, 2018

gianm mentioned this pull request Feb 6, 2018

[Backport] More ParseException handling for numeric dimensions #5356

Merged

jon-wei pushed a commit that referenced this pull request Feb 7, 2018

More ParseException handling for numeric dimensions (#5312) (#5356)

1d251d4

* Discard rows with unparseable numeric dimensions * PR comments * Don't throw away entire row on parse exception * PR comments * Fix import

Conversation

jon-wei commented Jan 31, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jon-wei Feb 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jon-wei commented Feb 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gianm commented Feb 1, 2018

Uh oh!

jon-wei commented Feb 1, 2018

Uh oh!

gianm left a comment

Choose a reason for hiding this comment

Uh oh!

jihoonson commented Feb 2, 2018

Uh oh!

jihoonson left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jon-wei commented Feb 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jihoonson commented Feb 2, 2018

Uh oh!

jihoonson left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jon-wei commented Jan 31, 2018 •

edited

Loading

jon-wei Feb 1, 2018 •

edited

Loading

jon-wei commented Feb 1, 2018 •

edited

Loading

jon-wei commented Feb 2, 2018 •

edited

Loading