Skip to content

Conversation

@HyukjinKwon
Copy link
Member

What changes were proposed in this pull request?

This PR proposes to treat empty strings for double and float types as null consistently. Looks we mistakenly missed this corner case, which I guess is not that serious since this looks happened betwen 1.x and 2.x, and pretty corner case.

For an easy reproducer, in case of double, the code below raises an error:

spark.read.option("mode", "FAILFAST").json(Seq("""{"a":"", "b": ""}""", """{"a": 1.1, "b": 1.1}""").toDS).show()
Caused by: java.lang.RuntimeException: Cannot parse  as double.
  at org.apache.spark.sql.catalyst.json.JacksonParser$$anonfun$makeConverter$7$$anonfun$apply$10.applyOrElse(JacksonParser.scala:163)
  at org.apache.spark.sql.catalyst.json.JacksonParser$$anonfun$makeConverter$7$$anonfun$apply$10.applyOrElse(JacksonParser.scala:152)
  at org.apache.spark.sql.catalyst.json.JacksonParser.org$apache$spark$sql$catalyst$json$JacksonParser$$parseJsonToken(JacksonParser.scala:277)
  at org.apache.spark.sql.catalyst.json.JacksonParser$$anonfun$makeConverter$7.apply(JacksonParser.scala:152)
  at org.apache.spark.sql.catalyst.json.JacksonParser$$anonfun$makeConverter$7.apply(JacksonParser.scala:152)
  at org.apache.spark.sql.catalyst.json.JacksonParser.org$apache$spark$sql$catalyst$json$JacksonParser$$convertObject(JacksonParser.scala:312)
  at org.apache.spark.sql.catalyst.json.JacksonParser$$anonfun$makeStructRootConverter$1$$anonfun$apply$2.applyOrElse(JacksonParser.scala:71)
  at org.apache.spark.sql.catalyst.json.JacksonParser$$anonfun$makeStructRootConverter$1$$anonfun$apply$2.applyOrElse(JacksonParser.scala:70)
  at org.apache.spark.sql.catalyst.json.JacksonParser.org$apache$spark$sql$catalyst$json$JacksonParser$$parseJsonToken(JacksonParser.scala:277)
  at org.apache.spark.sql.catalyst.json.JacksonParser$$anonfun$makeStructRootConverter$1.apply(JacksonParser.scala:70)
  at org.apache.spark.sql.catalyst.json.JacksonParser$$anonfun$makeStructRootConverter$1.apply(JacksonParser.scala:70)
  at org.apache.spark.sql.catalyst.json.JacksonParser$$anonfun$parse$2.apply(JacksonParser.scala:368)
  at org.apache.spark.sql.catalyst.json.JacksonParser$$anonfun$parse$2.apply(JacksonParser.scala:363)
  at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2491)
  at org.apache.spark.sql.catalyst.json.JacksonParser.parse(JacksonParser.scala:363)
  at org.apache.spark.sql.DataFrameReader$$anonfun$5$$anonfun$6.apply(DataFrameReader.scala:450)
  at org.apache.spark.sql.DataFrameReader$$anonfun$5$$anonfun$6.apply(DataFrameReader.scala:450)
  at org.apache.spark.sql.execution.datasources.FailureSafeParser.parse(FailureSafeParser.scala:61)
  ... 24 more

Unlike other types:

spark.read.option("mode", "FAILFAST").json(Seq("""{"a":"", "b": ""}""", """{"a": 1, "b": 1}""").toDS).show()
+----+----+
|   a|   b|
+----+----+
|null|null|
|   1|   1|
+----+----+

How was this patch tested?

Unit tests were added and manually tested.

@HyukjinKwon
Copy link
Member Author

cc @cloud-fan, @viirya and @fuqiliang

@HyukjinKwon
Copy link
Member Author

Looks few other types could potentially have this issue too. Let me fix them all here while I am here.

@HyukjinKwon HyukjinKwon changed the title [SPARK-25040][SQL] Empty string for double and float types should be nulls in JSON [WIP][SPARK-25040][SQL] Empty string for double and float types should be nulls in JSON Aug 7, 2018
@HyukjinKwon
Copy link
Member Author

Hm.. wait let me take a closer look.

@viirya
Copy link
Member

viirya commented Aug 7, 2018

Empty string should be treated as null for all non string types?

* This function throws an exception for failed conversion, but returns null for empty string,
* to guard the non string types.
*/
private def failedConversion[R >: Null](

@cloud-fan
Copy link
Contributor

Empty string should be treated as null for all non string types?

I would exclude complex types.

@HyukjinKwon
Copy link
Member Author

I found this:

https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JacksonParser.scala#L69

in branch-1.6 and was thinking if we actually should just disallow. Let me take a closer look and make some fixes here with some details soon.

@SparkQA
Copy link

SparkQA commented Aug 7, 2018

Test build #94342 has finished for PR 22019 at commit ef57fdd.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

What do you guys think about completely disallow empty strings in other types and target it 3.0.0? In theory, empty string is a string and I think strictly it's more correct to disallow them. Seems that's the original intention in the code.

@mgaido91
Copy link
Contributor

I agree with this proposal @HyukjinKwon. I think it is wrong to consider as a null an empty string. An empty string is not a valid value for an int/double/... So in case we have, we should fail I think.

@cloud-fan
Copy link
Contributor

SGTM

@viirya
Copy link
Member

viirya commented Aug 10, 2018

SGTM too.

@HyukjinKwon
Copy link
Member Author

Let me reopen a PR and proceed this after 2.4.0 or the code freeze

@HyukjinKwon
Copy link
Member Author

@viirya and @MaxGekk, are you busy? Do you mind if I ask to take over this? we will completely disallow empty strings in other types and target it 3.0.0. The changes wouldn't be too much and it requires to update the migration guide.

I will be busy for a couple of weeks so I would appreciate it if you find some time to take over this.

Otherwise, I will start to work on this after a couple of weeks.

@MaxGekk
Copy link
Member

MaxGekk commented Oct 12, 2018

Probably I will not be able to look at this for the next a few weeks.

@viirya
Copy link
Member

viirya commented Oct 12, 2018

@HyukjinKwon thanks for pinging me. Let me look at this and see if I can make a PR soon.

@HyukjinKwon HyukjinKwon deleted the double-float-empty branch October 16, 2018 12:45
asfgit pushed a commit that referenced this pull request Oct 23, 2018
…owed

## What changes were proposed in this pull request?

This takes over original PR at #22019. The original proposal is to have null for float and double types. Later a more reasonable proposal is to disallow empty strings. This patch adds logic to throw exception when finding empty strings for non string types.

## How was this patch tested?

Added test.

Closes #22787 from viirya/SPARK-25040.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
…owed

## What changes were proposed in this pull request?

This takes over original PR at apache#22019. The original proposal is to have null for float and double types. Later a more reasonable proposal is to disallow empty strings. This patch adds logic to throw exception when finding empty strings for non string types.

## How was this patch tested?

Added test.

Closes apache#22787 from viirya/SPARK-25040.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants