[SPARK-22165][SQL] Fixes type conflicts between double, long, decimals, dates and timestamps in partition column #19389

HyukjinKwon · 2017-09-29T05:48:56Z

What changes were proposed in this pull request?

This PR proposes to add a rule that re-uses TypeCoercion.findWiderCommonType when resolving type conflicts in partition values.

Currently, this uses numeric precedence-like comparison; therefore, it looks introducing failures for type conflicts between timestamps, dates and decimals, please see:

private val upCastingOrder: Seq[DataType] =
  Seq(NullType, IntegerType, LongType, FloatType, DoubleType, StringType)
...
literals.map(_.dataType).maxBy(upCastingOrder.indexOf(_))

The codes below:

val df = Seq((1, "2015-01-01"), (2, "2016-01-01 00:00:00")).toDF("i", "ts")
df.write.format("parquet").partitionBy("ts").save("/tmp/foo")
spark.read.load("/tmp/foo").printSchema()

val df = Seq((1, "1"), (2, "1" * 30)).toDF("i", "decimal")
df.write.format("parquet").partitionBy("decimal").save("/tmp/bar")
spark.read.load("/tmp/bar").printSchema()

produces output as below:

Before

root
 |-- i: integer (nullable = true)
 |-- ts: date (nullable = true)

root
 |-- i: integer (nullable = true)
 |-- decimal: integer (nullable = true)

After

root
 |-- i: integer (nullable = true)
 |-- ts: timestamp (nullable = true)

root
 |-- i: integer (nullable = true)
 |-- decimal: decimal(30,0) (nullable = true)

Type coercion table:

This PR proposes the type conflict resolusion as below:

Before

InputA \ InputB	`NullType`	`IntegerType`	`LongType`	`DecimalType(38,0)`	`DoubleType`	`DateType`	`TimestampType`	`StringType`
`NullType`	`StringType`	`IntegerType`	`LongType`	`StringType`	`DoubleType`	`StringType`	`StringType`	`StringType`
`IntegerType`	`IntegerType`	`IntegerType`	`LongType`	`IntegerType`	`DoubleType`	`IntegerType`	`IntegerType`	`StringType`
`LongType`	`LongType`	`LongType`	`LongType`	`LongType`	`DoubleType`	`LongType`	`LongType`	`StringType`
`DecimalType(38,0)`	`StringType`	`IntegerType`	`LongType`	`DecimalType(38,0)`	`DoubleType`	`DecimalType(38,0)`	`DecimalType(38,0)`	`StringType`
`DoubleType`	`DoubleType`	`DoubleType`	`DoubleType`	`DoubleType`	`DoubleType`	`DoubleType`	`DoubleType`	`StringType`
`DateType`	`StringType`	`IntegerType`	`LongType`	`DateType`	`DoubleType`	`DateType`	`DateType`	`StringType`
`TimestampType`	`StringType`	`IntegerType`	`LongType`	`TimestampType`	`DoubleType`	`TimestampType`	`TimestampType`	`StringType`
`StringType`	`StringType`	`StringType`	`StringType`	`StringType`	`StringType`	`StringType`	`StringType`	`StringType`

After

InputA \ InputB	`NullType`	`IntegerType`	`LongType`	`DecimalType(38,0)`	`DoubleType`	`DateType`	`TimestampType`	`StringType`
`NullType`	`NullType`	`IntegerType`	`LongType`	`DecimalType(38,0)`	`DoubleType`	`DateType`	`TimestampType`	`StringType`
`IntegerType`	`IntegerType`	`IntegerType`	`LongType`	`DecimalType(38,0)`	`DoubleType`	`StringType`	`StringType`	`StringType`
`LongType`	`LongType`	`LongType`	`LongType`	`DecimalType(38,0)`	`StringType`	`StringType`	`StringType`	`StringType`
`DecimalType(38,0)`	`DecimalType(38,0)`	`DecimalType(38,0)`	`DecimalType(38,0)`	`DecimalType(38,0)`	`StringType`	`StringType`	`StringType`	`StringType`
`DoubleType`	`DoubleType`	`DoubleType`	`StringType`	`StringType`	`DoubleType`	`StringType`	`StringType`	`StringType`
`DateType`	`DateType`	`StringType`	`StringType`	`StringType`	`StringType`	`DateType`	`TimestampType`	`StringType`
`TimestampType`	`TimestampType`	`StringType`	`StringType`	`StringType`	`StringType`	`TimestampType`	`TimestampType`	`StringType`
`StringType`	`StringType`	`StringType`	`StringType`	`StringType`	`StringType`	`StringType`	`StringType`	`StringType`

This was produced by:

  test("Print out chart") {
    val supportedTypes: Seq[DataType] = Seq(
      NullType, IntegerType, LongType, DecimalType(38, 0), DoubleType,
      DateType, TimestampType, StringType)

    // Old type conflict resolution:
    val upCastingOrder: Seq[DataType] =
      Seq(NullType, IntegerType, LongType, FloatType, DoubleType, StringType)
    def oldResolveTypeConflicts(dataTypes: Seq[DataType]): DataType = {
      val topType = dataTypes.maxBy(upCastingOrder.indexOf(_))
      if (topType == NullType) StringType else topType
    }
    println(s"|InputA \\ InputB|${supportedTypes.map(dt => s"`${dt.toString}`").mkString("|")}|")
    println(s"|------------------------|${supportedTypes.map(_ => "----------").mkString("|")}|")
    supportedTypes.foreach { inputA =>
      val types = supportedTypes.map(inputB => oldResolveTypeConflicts(Seq(inputA, inputB)))
      println(s"|**`$inputA`**|${types.map(dt => s"`${dt.toString}`").mkString("|")}|")
    }

    // New type conflict resolution:
    def newResolveTypeConflicts(dataTypes: Seq[DataType]): DataType = {
      dataTypes.fold[DataType](NullType)(findWiderTypeForPartitionColumn)
    }
    println(s"|InputA \\ InputB|${supportedTypes.map(dt => s"`${dt.toString}`").mkString("|")}|")
    println(s"|------------------------|${supportedTypes.map(_ => "----------").mkString("|")}|")
    supportedTypes.foreach { inputA =>
      val types = supportedTypes.map(inputB => newResolveTypeConflicts(Seq(inputA, inputB)))
      println(s"|**`$inputA`**|${types.map(dt => s"`${dt.toString}`").mkString("|")}|")
    }
  }

How was this patch tested?

Unit tests added in ParquetPartitionDiscoverySuite.

HyukjinKwon · 2017-09-29T05:50:12Z

cc @cloud-fan (I believe my similar PR was reviewed by you before), @ueshin and @squito.

SparkQA · 2017-09-29T07:04:44Z

Test build #82307 has finished for PR 19389 at commit 1e10336.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-09-29T07:09:15Z

retest this please

SparkQA · 2017-09-29T09:52:13Z

Test build #82310 has finished for PR 19389 at commit 1e10336.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

squito

lgtm, just a small comment on scoping, though I'm not an expert on this area. Left a couple of questions just for my own understanding.

Also, I'm curious why the tests are all going into ParquetPartitionDiscoverySuite -- this doesn't seem specific to parquet, and in fact I wonder if it will be different in parquet since the non-partition columns have schemas specified in the data. I'm just surprised this isn't tested across more formats.

squito · 2017-09-29T19:18:25Z

...cala/org/apache/spark/sql/execution/datasources/parquet/ParquetPartitionDiscoverySuite.scala

just for my own understanding -- these added asserts are just to improve failure msgs, right? They would have all been covered by the assert(actualSpec === spec) below anywyay, right?

Yes, it is. It was difficult for me to find which one was different.

squito · 2017-09-29T19:27:41Z

...cala/org/apache/spark/sql/execution/datasources/parquet/ParquetPartitionDiscoverySuite.scala

this passes even before the change, right?

as I mentioned on the other PR, I don't really understand why this works despite the issue you're fixing. Regardless, seems like a good test to have.

I think 2016-01-01 00:01:00 was placed first in literals somehow when calling resolveTypeConflicts. Let me try to make this test case not dependent on this.

squito · 2017-09-29T19:29:27Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

surprisingly, TypeCoercion is totally public, so this should probably be private[sql].

Actually, these were removed in SPARK-16813 and #14418 :).

HyukjinKwon · 2017-09-30T07:12:42Z

why the tests are all going into ParquetPartitionDiscoverySuite -- this doesn't seem specific to parquet, and in fact I wonder if it will be different in parquet since the non-partition columns have schemas specified in the data. I'm just surprised this isn't tested across more formats.

Yea, I agree since this problem is not specific to Parquet. Here, such changes and test cases look added in this file so far and I simply just decided to follow it, rather than including changes here restructuing or moving the test cases, partly for the easy of backporting (we should backport this into branch-2.2 and 2.1) and partly to reduce reviewing cost.

SparkQA · 2017-09-30T09:57:10Z

Test build #82345 has finished for PR 19389 at commit 52d0cc8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-10-02T23:18:17Z

This PR introduces the behavior changes. We are unable to do this.

gatorsmile · 2017-10-02T23:19:52Z

Please ensure no behavior change is introduced when fixing such issues. Also cc @cloud-fan

HyukjinKwon · 2017-10-03T01:40:59Z

@gatorsmile, could you elaborate which behaviour changes you mean?

HyukjinKwon · 2017-10-03T03:33:14Z

Do you mean before / after in PR description? They are bugs to fix, aren't they?

HyukjinKwon · 2017-10-05T13:53:23Z

ping @gatorsmile

HyukjinKwon · 2017-10-08T10:24:44Z

ping?

HyukjinKwon · 2017-10-10T23:44:26Z

@cloud-fan, could you take a look when you have some time please?

gatorsmile · 2017-10-12T05:29:21Z

Will review it this weekend.

HyukjinKwon · 2017-10-12T05:36:00Z

Thank you so much @gatorsmile.

HyukjinKwon · 2017-11-12T02:50:47Z

Hi @gatorsmile, could you please review this when you have some time?

gatorsmile · 2017-11-12T07:08:38Z

Generally, the current type inference/coercion rules are messy and random. We have to seriously revisit our type coercion.

After thinking it more, I think this change in this PR is pretty risky. It just introduces new type inference behaviors, although I do not like the previous one neither. These changes could easily cause new regression when our users upgrading their Spark versions.

For making the migration more smooth, my general proposal is to introduce a conf for each one when we change something like this, if we believe this is a bug fix. Remove or deprecate the internal conf in the next release (or after a few releases) if nobody raises the issue after a major release (around half a year).

gatorsmile · 2017-11-12T07:08:53Z

Also cc @rxin @cloud-fan @sameeragarwal

HyukjinKwon · 2017-11-12T07:24:33Z

Maybe, what do you think about opening a discussion in the mailing list? If I understood correctly, some committers have a different opinion on this (did I understand correctly?). That should deduplicate a discussion about it. I am also willing to actively join in it.

cloud-fan · 2017-11-12T22:41:26Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala

I think it's a bug, the previous code didn't consider the case when input literal types are outside of the upCastingOrder, and just pick the first type as the final type.

However I'm not sure what's the expected behavior. We need to figure out what's the possible data types for partition columns, and how to merge them.

Let me try to describe what this PR explicitly changes soon in terms of expected input types and merged types.

I think we will have the input types for this resolveTypeConflicts:

Seq( NullType, IntegerType, LongType, DoubleType, *DecimalType(...), DateType, TimestampType, StringType)

*DecimalType only when it's bigger than LongType:

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala

Lines 384 to 393 in 04975a6

val decimalTry = Try {

// `BigDecimal` conversion can fail when the `field` is not a form of number.

val bigDecimal = new JBigDecimal(raw)

// It reduces the cases for decimals by disallowing values having scale (eg. `1.1`).

require(bigDecimal.scale <= 0)

// `DecimalType` conversion can fail when

// 1. The precision is bigger than 38.

// 2. scale is bigger than precision.

Literal(bigDecimal)

}

Because:

this particular resolveTypeConflicts seems being only called through:

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala

Line 142 in 04975a6

val resolvedPartitionValues = resolvePartitions(pathsWithPartitionValues, timeZone)

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala

Line 337 in 04975a6

resolveTypeConflicts(values.map(_.literals(i)), timeZone)

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala

Line 474 in 04975a6

private def resolveTypeConflicts(literals: Seq[Literal], timeZone: TimeZone): Seq[Literal] = {

In the first call, I am seeing pathsWithPartitionValues is constructed by partitionValues, which is the output from parsePartition:

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala

Line 108 in 04975a6

parsePartition(path, typeInference, basePaths, timeZone)

which parses the input by parsePartitionColumn:

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala

Line 209 in 04975a6

parsePartitionColumn(currentPath.getName, typeInference, timeZone)

which calls this inferPartitionColumnValue:

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala

Lines 384 to 428 in 04975a6

val decimalTry = Try {

// `BigDecimal` conversion can fail when the `field` is not a form of number.

val bigDecimal = new JBigDecimal(raw)

// It reduces the cases for decimals by disallowing values having scale (eg. `1.1`).

require(bigDecimal.scale <= 0)

// `DecimalType` conversion can fail when

// 1. The precision is bigger than 38.

// 2. scale is bigger than precision.

Literal(bigDecimal)

}

if (typeInference) {

// First tries integral types

Try(Literal.create(Integer.parseInt(raw), IntegerType))

.orElse(Try(Literal.create(JLong.parseLong(raw), LongType)))

.orElse(decimalTry)

// Then falls back to fractional types

.orElse(Try(Literal.create(JDouble.parseDouble(raw), DoubleType)))

// Then falls back to date/timestamp types

.orElse(Try(

Literal.create(

DateTimeUtils.getThreadLocalTimestampFormat(timeZone)

.parse(unescapePathName(raw)).getTime * 1000L,

TimestampType)))

.orElse(Try(

Literal.create(

DateTimeUtils.millisToDays(

DateTimeUtils.getThreadLocalDateFormat.parse(raw).getTime),

DateType)))

// Then falls back to string

.getOrElse {

if (raw == DEFAULT_PARTITION_NAME) {

Literal.create(null, NullType)

} else {

Literal.create(unescapePathName(raw), StringType)

}

}

} else {

if (raw == DEFAULT_PARTITION_NAME) {

Literal.create(null, NullType)

} else {

Literal.create(unescapePathName(raw), StringType)

}

}

}

So, for the types:

Seq( NullType, IntegerType, LongType, DoubleType, *DecimalType(...), DateType, TimestampType, StringType)

I produced a chart as below by this codes:

test("Print out chart") { val supportedTypes: Seq[DataType] = Seq( NullType, IntegerType, LongType, DoubleType, DecimalType(38, 0), DateType, TimestampType, StringType) val combinations = for { t1 <- supportedTypes t2 <- supportedTypes } yield Seq(t1,t2) // Old type conflict resolution: val upCastingOrder: Seq[DataType] = Seq(NullType, IntegerType, LongType, FloatType, DoubleType, StringType) def oldResolveTypeConflicts(dataTypes: Seq[DataType]): DataType = { val topType = dataTypes.maxBy(upCastingOrder.indexOf(_)) if (topType == NullType) StringType else topType } // New type conflict resolution: def newResolveTypeConflicts(dataTypes: Seq[DataType]): DataType = { TypeCoercion.findWiderCommonType(dataTypes) match { case Some(NullType) => StringType case Some(dt: DataType) => dt case _ => StringType } } println("|Input types|Old output type|New output type|") println("|-----------|---------------|---------------|") combinations.foreach { pair => val oldType = oldResolveTypeConflicts(pair) val newType = newResolveTypeConflicts(pair) if (oldType != newType) { println(s"|[`${pair(0)}`, `${pair(1)}`]|`$oldType`|`$newType`|") } } }

So, looks this PR makes the changes in type resolution as below:

Input types Old output type New output type

[NullType, DecimalType(38,0)] StringType DecimalType(38,0)

[NullType, DateType] StringType DateType

[NullType, TimestampType] StringType TimestampType

[IntegerType, DecimalType(38,0)] IntegerType DecimalType(38,0)

[IntegerType, DateType] IntegerType StringType

[IntegerType, TimestampType] IntegerType StringType

[LongType, DecimalType(38,0)] LongType DecimalType(38,0)

[LongType, DateType] LongType StringType

[LongType, TimestampType] LongType StringType

[DoubleType, DateType] DoubleType StringType

[DoubleType, TimestampType] DoubleType StringType

[DecimalType(38,0), NullType] StringType DecimalType(38,0)

[DecimalType(38,0), IntegerType] IntegerType DecimalType(38,0)

[DecimalType(38,0), LongType] LongType DecimalType(38,0)

[DecimalType(38,0), DateType] DecimalType(38,0) StringType

[DecimalType(38,0), TimestampType] DecimalType(38,0) StringType

[DateType, NullType] StringType DateType

[DateType, IntegerType] IntegerType StringType

[DateType, LongType] LongType StringType

[DateType, DoubleType] DoubleType StringType

[DateType, DecimalType(38,0)] DateType StringType

[DateType, TimestampType] DateType TimestampType

[TimestampType, NullType] StringType TimestampType

[TimestampType, IntegerType] IntegerType StringType

[TimestampType, LongType] LongType StringType

[TimestampType, DoubleType] DoubleType StringType

[TimestampType, DecimalType(38,0)] TimestampType StringType

I think the new behavior is much better. It seems like the previous behavior is just wrong, but it's very rare to see different data types in partition columns, and that's why no users open tickets for it yet.

cc @gatorsmile

Do we really wanna document the previous wrong behavior in the doc? An example is merging TimestampType and DateType, the result is non-deterministic, depends on which partition path gets parsed first. How do we document that?

BTW this is only used in partition discovery, I can't think of a problematic case for it. The only thing it can break is, users give Spark a data dir to do partition discovering, and users make an assumption on the inferred partition schema, but then we can argue that why users ask Spark to do partition discovery at the beginning?

Both the values and the schemas could be changed. The external applications might be broken if the schema is different.

The new behaviors are consistent with what we does for the other type coercion cases. However, implicit type casting and partition discovery are unstable. The other mature systems have clear/stable rules about it. Below is an example. https://docs.oracle.com/cloud/latest/db112/SQLRF/sql_elements002.htm#g195937

If each release introduces new behaviors, it becomes hard to use by the end users who have such expectation. Thus, my suggestion is to first stabilize our type coercion rules before addressing this.

#18853 is the first PR attempt in this direction.

Partitioned columns are different from normal type coercion cases, they are literally all string type, and we are just trying to find a most reasonable type of them.

The previous behavior was there since the very beginning, which I think didn't go through a decent discussion. This is the first time we seriously design the type merging logic for partition discovery. I think it doesn't need to be blocked by the type coercion stabilization work, as they can diverge.

@HyukjinKwon can you send the proposal to dev list? I think we need more feedback, e.g. people may want more strict rules and have more cases to fallback to string.

Let me send the proposal to dev soon tonight (KST)

…ition column

HyukjinKwon · 2017-11-21T06:40:20Z

I have just made a table to check the diff easily:

Before:

InputA \ InputB	`NullType`	`IntegerType`	`LongType`	`DecimalType(38,0)`	`DoubleType`	`DateType`	`TimestampType`
`NullType`	`StringType`			`StringType`		`StringType`	`StringType`
`IntegerType`				`IntegerType`		`IntegerType`	`IntegerType`
`LongType`				`LongType`	`DoubleType`	`LongType`	`LongType`
`DecimalType(38,0)`	`StringType`	`IntegerType`	`LongType`		`DoubleType`	`DecimalType(38,0)`	`DecimalType(38,0)`
`DoubleType`			`DoubleType`	`DoubleType`		`DoubleType`	`DoubleType`
`DateType`	`StringType`	`IntegerType`	`LongType`	`DateType`	`DoubleType`		`DateType`
`TimestampType`	`StringType`	`IntegerType`	`LongType`	`TimestampType`	`DoubleType`
`StringType`

After:

InputA \ InputB	`NullType`	`IntegerType`	`LongType`	`DecimalType(38,0)`	`DoubleType`	`DateType`	`TimestampType`
`NullType`	`NullType`			`DecimalType(38,0)`		`DateType`	`TimestampType`
`IntegerType`				`DecimalType(38,0)`		`StringType`	`StringType`
`LongType`				`DecimalType(38,0)`	`StringType`	`StringType`	`StringType`
`DecimalType(38,0)`	`DecimalType(38,0)`	`DecimalType(38,0)`	`DecimalType(38,0)`		`StringType`	`StringType`	`StringType`
`DoubleType`			`StringType`	`StringType`		`StringType`	`StringType`
`DateType`	`DateType`	`StringType`	`StringType`	`StringType`	`StringType`		`TimestampType`
`TimestampType`	`TimestampType`	`StringType`	`StringType`	`StringType`	`StringType`
`StringType`

SparkQA · 2017-11-21T08:05:01Z

Test build #84049 has finished for PR 19389 at commit a1f1c3a.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-11-21T08:05:02Z

Test build #84051 has finished for PR 19389 at commit 07bcf36.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-11-21T08:07:06Z

retest this please

SparkQA · 2017-11-21T10:58:10Z

Test build #84059 has finished for PR 19389 at commit 07bcf36.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-11-21T12:38:14Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala

+   * |DateType         |DateType         |StringType       |StringType       |StringType       |StringType|DateType     |TimestampType|StringType|
+   * |TimestampType    |TimestampType    |StringType       |StringType       |StringType       |StringType|TimestampType|TimestampType|StringType|
+   * |StringType       |StringType       |StringType       |StringType       |StringType       |StringType|StringType   |StringType   |StringType|
+   * +-----------------+-----------------+-----------------+-----------------+-----------------+----------+-------------+-------------+----------+


we should also put this table in sql-programming-guide and sql migration guide.

cloud-fan · 2017-11-21T12:39:12Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala

-      if (topType == NullType) StringType else topType
-    }
+    val litTypes = literals.map(_.dataType)
+    val desiredType = litTypes.fold[DataType](NullType)(findWiderTypeForPartitionColumn)


nit: I think we can use reduce? literals should not be empty.

Yup. I just address this one and the one above.

HyukjinKwon · 2017-11-21T14:42:28Z

docs/sql-programming-guide.md


  - Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column (named `_corrupt_record` by default). For example, `spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()` and `spark.read.schema(schema).json(file).select("_corrupt_record").show()`. Instead, you can cache or save the parsed results and then send the same query. For example, `val df = spark.read.schema(schema).json(file).cache()` and then `df.filter($"_corrupt_record".isNotNull).count()`.
  - The `percentile_approx` function previously accepted numeric type input and output double type results. Now it supports date type, timestamp type and numeric types as input types. The result type is also changed to be the same as the input type, which is more reasonable for percentiles.
+  - Partition column inference previously found incorrect common type for different inferred types, for example, previously it ended up with double type as the common type for double type and date type. Now it finds the correct common type for such conflicts. The conflict resolution follows the table below:


Built doc shows as below:

cloud-fan · 2017-11-21T14:48:49Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala

+   * | NullType          | NullType          | IntegerType       | LongType          | DecimalType(38,0) | DoubleType | DateType      | TimestampType | StringType |
+   * | IntegerType       | IntegerType       | IntegerType       | LongType          | DecimalType(38,0) | DoubleType | StringType    | StringType    | StringType |
+   * | LongType          | LongType          | LongType          | LongType          | DecimalType(38,0) | StringType | StringType    | StringType    | StringType |
+   * | DecimalType(38,0) | DecimalType(38,0) | DecimalType(38,0) | DecimalType(38,0) | DecimalType(38,0) | StringType | StringType    | StringType    | StringType |


might be good to explain why we can only see DecimaType(38, 0)

cloud-fan · 2017-11-21T14:49:19Z

LGTM

cloud-fan · 2017-11-21T14:50:45Z

One thing we can do in follow-up: the sql programming guide does have a partition discovery section, but it's under the parquet section, we should move it a layer up and put the type casting table there.

HyukjinKwon · 2017-11-21T15:22:57Z

docs/sql-programming-guide.md


  - Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column (named `_corrupt_record` by default). For example, `spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()` and `spark.read.schema(schema).json(file).select("_corrupt_record").show()`. Instead, you can cache or save the parsed results and then send the same query. For example, `val df = spark.read.schema(schema).json(file).cache()` and then `df.filter($"_corrupt_record".isNotNull).count()`.
  - The `percentile_approx` function previously accepted numeric type input and output double type results. Now it supports date type, timestamp type and numeric types as input types. The result type is also changed to be the same as the input type, which is more reasonable for percentiles.
+  - Partition column inference previously found incorrect common type for different inferred types, for example, previously it ended up with double type as the common type for double type and date type. Now it finds the correct common type for such conflicts. The conflict resolution follows the table below:


Doc shows as below:

SparkQA · 2017-11-21T16:18:21Z

Test build #84073 has finished for PR 19389 at commit 476ec06.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-11-21T18:07:47Z

Test build #84074 has finished for PR 19389 at commit c67f646.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-11-21T19:54:11Z

thanks, merging to master!

squito approved these changes Sep 29, 2017

View reviewed changes

cloud-fan reviewed Nov 12, 2017

View reviewed changes

HyukjinKwon added 2 commits November 21, 2017 13:39

Resolve type conflicts between decimals, dates and timestamps in part…

6246ed5

…ition column

Make the test case not dependent on the order

b6256c0

HyukjinKwon force-pushed the partition-type-coercion branch from 52d0cc8 to a1f1c3a Compare November 21, 2017 06:17

Prevent precision loss for double and long/decimal

07bcf36

HyukjinKwon force-pushed the partition-type-coercion branch from a1f1c3a to 07bcf36 Compare November 21, 2017 06:48

HyukjinKwon changed the title ~~[SPARK-22165][SQL] Resolve type conflicts between decimals, dates and timestamps in partition column~~ [SPARK-22165][SQL] Fixes type conflicts between double, long, decimals, dates and timestamps in partition column Nov 21, 2017

cloud-fan reviewed Nov 21, 2017

View reviewed changes

Address comments

476ec06

HyukjinKwon commented Nov 21, 2017

View reviewed changes

cloud-fan reviewed Nov 21, 2017

View reviewed changes

Address comments

c67f646

HyukjinKwon commented Nov 21, 2017

View reviewed changes

asfgit closed this in 6d7ebf2 Nov 21, 2017

HyukjinKwon deleted the partition-type-coercion branch January 2, 2018 03:37

	val decimalTry = Try {
	// `BigDecimal` conversion can fail when the `field` is not a form of number.
	val bigDecimal = new JBigDecimal(raw)
	// It reduces the cases for decimals by disallowing values having scale (eg. `1.1`).
	require(bigDecimal.scale <= 0)
	// `DecimalType` conversion can fail when
	// 1. The precision is bigger than 38.
	// 2. scale is bigger than precision.
	Literal(bigDecimal)
	}

Input types	Old output type	New output type
[`NullType`, `DecimalType(38,0)`]	`StringType`	`DecimalType(38,0)`
[`NullType`, `DateType`]	`StringType`	`DateType`
[`NullType`, `TimestampType`]	`StringType`	`TimestampType`
[`IntegerType`, `DecimalType(38,0)`]	`IntegerType`	`DecimalType(38,0)`
[`IntegerType`, `DateType`]	`IntegerType`	`StringType`
[`IntegerType`, `TimestampType`]	`IntegerType`	`StringType`
[`LongType`, `DecimalType(38,0)`]	`LongType`	`DecimalType(38,0)`
[`LongType`, `DateType`]	`LongType`	`StringType`
[`LongType`, `TimestampType`]	`LongType`	`StringType`
[`DoubleType`, `DateType`]	`DoubleType`	`StringType`
[`DoubleType`, `TimestampType`]	`DoubleType`	`StringType`
[`DecimalType(38,0)`, `NullType`]	`StringType`	`DecimalType(38,0)`
[`DecimalType(38,0)`, `IntegerType`]	`IntegerType`	`DecimalType(38,0)`
[`DecimalType(38,0)`, `LongType`]	`LongType`	`DecimalType(38,0)`
[`DecimalType(38,0)`, `DateType`]	`DecimalType(38,0)`	`StringType`
[`DecimalType(38,0)`, `TimestampType`]	`DecimalType(38,0)`	`StringType`
[`DateType`, `NullType`]	`StringType`	`DateType`
[`DateType`, `IntegerType`]	`IntegerType`	`StringType`
[`DateType`, `LongType`]	`LongType`	`StringType`
[`DateType`, `DoubleType`]	`DoubleType`	`StringType`
[`DateType`, `DecimalType(38,0)`]	`DateType`	`StringType`
[`DateType`, `TimestampType`]	`DateType`	`TimestampType`
[`TimestampType`, `NullType`]	`StringType`	`TimestampType`
[`TimestampType`, `IntegerType`]	`IntegerType`	`StringType`
[`TimestampType`, `LongType`]	`LongType`	`StringType`
[`TimestampType`, `DoubleType`]	`DoubleType`	`StringType`
[`TimestampType`, `DecimalType(38,0)`]	`TimestampType`	`StringType`

[SPARK-22165][SQL] Fixes type conflicts between double, long, decimals, dates and timestamps in partition column #19389

[SPARK-22165][SQL] Fixes type conflicts between double, long, decimals, dates and timestamps in partition column #19389

Uh oh!

Conversation

HyukjinKwon commented Sep 29, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Type coercion table:

How was this patch tested?

Uh oh!

HyukjinKwon commented Sep 29, 2017

Uh oh!

SparkQA commented Sep 29, 2017

Uh oh!

HyukjinKwon commented Sep 29, 2017

Uh oh!

SparkQA commented Sep 29, 2017

Uh oh!

squito left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Sep 30, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Sep 30, 2017

Uh oh!

SparkQA commented Sep 30, 2017

Uh oh!

gatorsmile commented Oct 2, 2017

Uh oh!

gatorsmile commented Oct 2, 2017

Uh oh!

HyukjinKwon commented Oct 3, 2017

Uh oh!

HyukjinKwon commented Oct 3, 2017

Uh oh!

HyukjinKwon commented Oct 5, 2017

Uh oh!

HyukjinKwon commented Oct 8, 2017

Uh oh!

HyukjinKwon commented Oct 10, 2017

Uh oh!

gatorsmile commented Oct 12, 2017

Uh oh!

HyukjinKwon commented Oct 12, 2017

Uh oh!

HyukjinKwon commented Nov 12, 2017

Uh oh!

gatorsmile commented Nov 12, 2017

Uh oh!

gatorsmile commented Nov 12, 2017

Uh oh!

HyukjinKwon commented Nov 12, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Nov 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Sep 29, 2017 •

edited

Loading

HyukjinKwon Sep 30, 2017 •

edited

Loading

HyukjinKwon commented Nov 12, 2017 •

edited

Loading

HyukjinKwon Nov 13, 2017 •

edited

Loading