[SPARK-40819][SQL][3.2] Timestamp nanos behaviour regression #39905

awdavidson · 2023-02-06T12:39:23Z

As per @HyukjinKwon request on #38312 to backport fix into 3.2

What changes were proposed in this pull request?

Handle TimeUnit.NANOS for parquet Timestamps addressing a regression in behaviour since 3.2

Why are the changes needed?

Since version 3.2 reading parquet files that contain attributes with type TIMESTAMP(NANOS,true) is not possible as ParquetSchemaConverter returns

Caused by: org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64 (TIMESTAMP(NANOS,true))

https://issues.apache.org/jira/browse/SPARK-34661 introduced a change matching on the LogicalTypeAnnotation which only covers Timestamp cases for TimeUnit.MILLIS and TimeUnit.MICROS meaning TimeUnit.NANOS would return illegalType()

Prior to 3.2 the matching used the originalType which for TIMESTAMP(NANOS,true) return null and therefore resulted to a LongType, the change proposed is too consider TimeUnit.NANOS and return LongType making behaviour the same as before.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added unit test covering this scenario.
Internally deployed to read parquet files that contain TIMESTAMP(NANOS,true)

HyukjinKwon · 2023-02-07T00:33:57Z

The test fails:

2023-02-06T13:38:00.5771066Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m- SPARK-40819: parquet file with TIMESTAMP(NANOS, true) (with default nanosAsLong=false) *** FAILED *** (54 milliseconds)�[0m�[0m
2023-02-06T13:38:00.5772303Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  "Job aborted due to stage failure: Task 0 in stage 14.0 failed 1 times, most recent failure: Lost task 0.0 in stage 14.0 (TID 19) (localhost executor driver): org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64 (TIMESTAMP(NANOS,true))�[0m�[0m
2023-02-06T13:38:00.5773363Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at org.apache.spark.sql.errors.QueryCompilationErrors$.illegalParquetTypeError(QueryCompilationErrors.scala:1284)�[0m�[0m
2023-02-06T13:38:00.5774420Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.illegalType$1(ParquetSchemaConverter.scala:109)�[0m�[0m
2023-02-06T13:38:00.5841228Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertPrimitiveField(ParquetSchemaConverter.scala:183)�[0m�[0m
2023-02-06T13:38:00.5847697Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertField(ParquetSchemaConverter.scala:94)�[0m�[0m
2023-02-06T13:38:00.5848680Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.$anonfun$convert$1(ParquetSchemaConverter.scala:73)�[0m�[0m
2023-02-06T13:38:00.5849454Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)�[0m�[0m
2023-02-06T13:38:00.5849987Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at scala.collection.Iterator.foreach(Iterator.scala:943)�[0m�[0m
2023-02-06T13:38:00.5850492Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at scala.collection.Iterator.foreach$(Iterator.scala:943)�[0m�[0m
2023-02-06T13:38:00.5851023Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)�[0m�[0m
2023-02-06T13:38:00.5851562Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at scala.collection.IterableLike.foreach(IterableLike.scala:74)�[0m�[0m
2023-02-06T13:38:00.5852089Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)�[0m�[0m
2023-02-06T13:38:00.5852623Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)�[0m�[0m
2023-02-06T13:38:00.5853166Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at scala.collection.TraversableLike.map(TraversableLike.scala:286)�[0m�[0m
2023-02-06T13:38:00.5853955Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at scala.collection.TraversableLike.map$(TraversableLike.scala:279)�[0m�[0m
2023-02-06T13:38:00.5854516Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at scala.collection.AbstractTraversable.map(Traversable.scala:108)�[0m�[0m
2023-02-06T13:38:00.5855295Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convert(ParquetSchemaConverter.scala:70)�[0m�[0m
2023-02-06T13:38:00.5856239Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convert(ParquetSchemaConverter.scala:67)�[0m�[0m
2023-02-06T13:38:00.5857130Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readSchemaFromFooter$2(ParquetFileFormat.scala:558)�[0m�[0m
2023-02-06T13:38:00.5857754Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at scala.Option.getOrElse(Option.scala:189)�[0m�[0m
2023-02-06T13:38:00.5858411Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.readSchemaFromFooter(ParquetFileFormat.scala:558)�[0m�[0m
2023-02-06T13:38:00.5859409Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$2(ParquetFileFormat.scala:538)�[0m�[0m
2023-02-06T13:38:00.5860069Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at scala.collection.immutable.Stream.map(Stream.scala:418)�[0m�[0m
2023-02-06T13:38:00.5860841Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$1(ParquetFileFormat.scala:538)�[0m�[0m
2023-02-06T13:38:00.5861703Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$1$adapted(ParquetFileFormat.scala:530)�[0m�[0m
2023-02-06T13:38:00.5862534Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$2(SchemaMergeUtils.scala:76)�[0m�[0m
2023-02-06T13:38:00.5863177Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:863)�[0m�[0m
2023-02-06T13:38:00.5863711Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:863)�[0m�[0m
2023-02-06T13:38:00.5864291Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)�[0m�[0m
2023-02-06T13:38:00.5864877Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)�[0m�[0m
2023-02-06T13:38:00.5865393Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)�[0m�[0m
2023-02-06T13:38:00.5865932Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)�[0m�[0m
2023-02-06T13:38:00.5866461Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at org.apache.spark.scheduler.Task.run(Task.scala:131)�[0m�[0m
2023-02-06T13:38:00.5867017Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)�[0m�[0m
2023-02-06T13:38:00.5867565Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)�[0m�[0m
2023-02-06T13:38:00.5868095Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)�[0m�[0m
2023-02-06T13:38:00.5868696Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)�[0m�[0m
2023-02-06T13:38:00.5869308Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)�[0m�[0m
2023-02-06T13:38:00.5869799Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at java.lang.Thread.run(Thread.java:750)�[0m�[0m
2023-02-06T13:38:00.5870133Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m  �[0m
2023-02-06T13:38:00.5870657Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  Driver stacktrace:" did not contain "Illegal Parquet type: INT64 (TIMESTAMP(NANOS,true))." (ParquetSchemaSuite.scala:494)�[0m�[0m
2023-02-06T13:38:00.5871221Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  org.scalatest.exceptions.TestFailedException:�[0m�[0m
2023-02-06T13:38:00.5871789Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)�[0m�[0m
2023-02-06T13:38:00.5872407Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)�[0m�[0m
2023-02-06T13:38:00.5872999Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)�[0m�[0m
2023-02-06T13:38:00.5873557Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)�[0m�[0m
2023-02-06T13:38:00.5874263Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaSuite.$anonfun$new$13(ParquetSchemaSuite.scala:494)�[0m�[0m
2023-02-06T13:38:00.5874954Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)�[0m�[0m
2023-02-06T13:38:00.5875457Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)�[0m�[0m
2023-02-06T13:38:00.5875954Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)�[0m�[0m
2023-02-06T13:38:00.5876449Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.Transformer.apply(Transformer.scala:22)�[0m�[0m
2023-02-06T13:38:00.5876989Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.Transformer.apply(Transformer.scala:20)�[0m�[0m
2023-02-06T13:38:00.5877551Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)�[0m�[0m
2023-02-06T13:38:00.5878120Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:190)�[0m�[0m
2023-02-06T13:38:00.5878751Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)�[0m�[0m
2023-02-06T13:38:00.5879391Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)�[0m�[0m
2023-02-06T13:38:00.5879939Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)�[0m�[0m
2023-02-06T13:38:00.5880500Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)�[0m�[0m
2023-02-06T13:38:00.5881095Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)�[0m�[0m
2023-02-06T13:38:00.5881738Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:62)�[0m�[0m
2023-02-06T13:38:00.5882355Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)�[0m�[0m
2023-02-06T13:38:00.5882924Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)�[0m�[0m

HyukjinKwon · 2023-02-08T02:07:24Z

Merged to branch-3.2.

As per HyukjinKwon request on #38312 to backport fix into 3.2 ### What changes were proposed in this pull request? Handle `TimeUnit.NANOS` for parquet `Timestamps` addressing a regression in behaviour since 3.2 ### Why are the changes needed? Since version 3.2 reading parquet files that contain attributes with type `TIMESTAMP(NANOS,true)` is not possible as ParquetSchemaConverter returns ``` Caused by: org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64 (TIMESTAMP(NANOS,true)) ``` https://issues.apache.org/jira/browse/SPARK-34661 introduced a change matching on the `LogicalTypeAnnotation` which only covers Timestamp cases for `TimeUnit.MILLIS` and `TimeUnit.MICROS` meaning `TimeUnit.NANOS` would return `illegalType()` Prior to 3.2 the matching used the `originalType` which for `TIMESTAMP(NANOS,true)` return `null` and therefore resulted to a `LongType`, the change proposed is too consider `TimeUnit.NANOS` and return `LongType` making behaviour the same as before. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit test covering this scenario. Internally deployed to read parquet files that contain `TIMESTAMP(NANOS,true)` Closes #39905 from awdavidson/ts-nanos-fix-3.2. Authored-by: alfreddavidson <alfie.davidson9@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

As per HyukjinKwon request on apache#38312 to backport fix into 3.2 ### What changes were proposed in this pull request? Handle `TimeUnit.NANOS` for parquet `Timestamps` addressing a regression in behaviour since 3.2 ### Why are the changes needed? Since version 3.2 reading parquet files that contain attributes with type `TIMESTAMP(NANOS,true)` is not possible as ParquetSchemaConverter returns ``` Caused by: org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64 (TIMESTAMP(NANOS,true)) ``` https://issues.apache.org/jira/browse/SPARK-34661 introduced a change matching on the `LogicalTypeAnnotation` which only covers Timestamp cases for `TimeUnit.MILLIS` and `TimeUnit.MICROS` meaning `TimeUnit.NANOS` would return `illegalType()` Prior to 3.2 the matching used the `originalType` which for `TIMESTAMP(NANOS,true)` return `null` and therefore resulted to a `LongType`, the change proposed is too consider `TimeUnit.NANOS` and return `LongType` making behaviour the same as before. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit test covering this scenario. Internally deployed to read parquet files that contain `TIMESTAMP(NANOS,true)` Closes apache#39905 from awdavidson/ts-nanos-fix-3.2. Authored-by: alfreddavidson <alfie.davidson9@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

Prepare changes for 3.2 backport

7b0f1b0

github-actions bot added the SQL label Feb 6, 2023

HyukjinKwon approved these changes Feb 6, 2023

View reviewed changes

awdavidson marked this pull request as ready for review February 6, 2023 20:14

dongjoon-hyun changed the title ~~[SPARK-40819][SQL] Backport Timestamp nanos regression to 3.2~~ [SPARK-40819][SQL][3.2] Timestamp nanos behaviour regression Feb 7, 2023

update ut

0129d69

HyukjinKwon closed this Feb 8, 2023

razajafri mentioned this pull request Feb 14, 2023

Fix cache test for Spark 3.3.2 [databricks] NVIDIA/spark-rapids#7772

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-40819][SQL][3.2] Timestamp nanos behaviour regression #39905

[SPARK-40819][SQL][3.2] Timestamp nanos behaviour regression #39905

Uh oh!

awdavidson commented Feb 6, 2023

Uh oh!

HyukjinKwon commented Feb 7, 2023

Uh oh!

HyukjinKwon commented Feb 8, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-40819][SQL][3.2] Timestamp nanos behaviour regression #39905

[SPARK-40819][SQL][3.2] Timestamp nanos behaviour regression #39905

Uh oh!

Conversation

awdavidson commented Feb 6, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

HyukjinKwon commented Feb 7, 2023

Uh oh!

HyukjinKwon commented Feb 8, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants