Skip to content

Conversation

@awdavidson
Copy link
Contributor

As per @HyukjinKwon request on #38312 to backport fix into 3.2

What changes were proposed in this pull request?

Handle TimeUnit.NANOS for parquet Timestamps addressing a regression in behaviour since 3.2

Why are the changes needed?

Since version 3.2 reading parquet files that contain attributes with type TIMESTAMP(NANOS,true) is not possible as ParquetSchemaConverter returns

Caused by: org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64 (TIMESTAMP(NANOS,true))

https://issues.apache.org/jira/browse/SPARK-34661 introduced a change matching on the LogicalTypeAnnotation which only covers Timestamp cases for TimeUnit.MILLIS and TimeUnit.MICROS meaning TimeUnit.NANOS would return illegalType()

Prior to 3.2 the matching used the originalType which for TIMESTAMP(NANOS,true) return null and therefore resulted to a LongType, the change proposed is too consider TimeUnit.NANOS and return LongType making behaviour the same as before.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added unit test covering this scenario.
Internally deployed to read parquet files that contain TIMESTAMP(NANOS,true)

@github-actions github-actions bot added the SQL label Feb 6, 2023
@awdavidson awdavidson marked this pull request as ready for review February 6, 2023 20:14
@HyukjinKwon
Copy link
Member

The test fails:

2023-02-06T13:38:00.5771066Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m- SPARK-40819: parquet file with TIMESTAMP(NANOS, true) (with default nanosAsLong=false) *** FAILED *** (54 milliseconds)�[0m�[0m
2023-02-06T13:38:00.5772303Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  "Job aborted due to stage failure: Task 0 in stage 14.0 failed 1 times, most recent failure: Lost task 0.0 in stage 14.0 (TID 19) (localhost executor driver): org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64 (TIMESTAMP(NANOS,true))�[0m�[0m
2023-02-06T13:38:00.5773363Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at org.apache.spark.sql.errors.QueryCompilationErrors$.illegalParquetTypeError(QueryCompilationErrors.scala:1284)�[0m�[0m
2023-02-06T13:38:00.5774420Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.illegalType$1(ParquetSchemaConverter.scala:109)�[0m�[0m
2023-02-06T13:38:00.5841228Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertPrimitiveField(ParquetSchemaConverter.scala:183)�[0m�[0m
2023-02-06T13:38:00.5847697Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertField(ParquetSchemaConverter.scala:94)�[0m�[0m
2023-02-06T13:38:00.5848680Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.$anonfun$convert$1(ParquetSchemaConverter.scala:73)�[0m�[0m
2023-02-06T13:38:00.5849454Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)�[0m�[0m
2023-02-06T13:38:00.5849987Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at scala.collection.Iterator.foreach(Iterator.scala:943)�[0m�[0m
2023-02-06T13:38:00.5850492Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at scala.collection.Iterator.foreach$(Iterator.scala:943)�[0m�[0m
2023-02-06T13:38:00.5851023Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)�[0m�[0m
2023-02-06T13:38:00.5851562Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at scala.collection.IterableLike.foreach(IterableLike.scala:74)�[0m�[0m
2023-02-06T13:38:00.5852089Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)�[0m�[0m
2023-02-06T13:38:00.5852623Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)�[0m�[0m
2023-02-06T13:38:00.5853166Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at scala.collection.TraversableLike.map(TraversableLike.scala:286)�[0m�[0m
2023-02-06T13:38:00.5853955Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at scala.collection.TraversableLike.map$(TraversableLike.scala:279)�[0m�[0m
2023-02-06T13:38:00.5854516Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at scala.collection.AbstractTraversable.map(Traversable.scala:108)�[0m�[0m
2023-02-06T13:38:00.5855295Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convert(ParquetSchemaConverter.scala:70)�[0m�[0m
2023-02-06T13:38:00.5856239Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convert(ParquetSchemaConverter.scala:67)�[0m�[0m
2023-02-06T13:38:00.5857130Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readSchemaFromFooter$2(ParquetFileFormat.scala:558)�[0m�[0m
2023-02-06T13:38:00.5857754Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at scala.Option.getOrElse(Option.scala:189)�[0m�[0m
2023-02-06T13:38:00.5858411Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.readSchemaFromFooter(ParquetFileFormat.scala:558)�[0m�[0m
2023-02-06T13:38:00.5859409Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$2(ParquetFileFormat.scala:538)�[0m�[0m
2023-02-06T13:38:00.5860069Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at scala.collection.immutable.Stream.map(Stream.scala:418)�[0m�[0m
2023-02-06T13:38:00.5860841Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$1(ParquetFileFormat.scala:538)�[0m�[0m
2023-02-06T13:38:00.5861703Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$1$adapted(ParquetFileFormat.scala:530)�[0m�[0m
2023-02-06T13:38:00.5862534Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$2(SchemaMergeUtils.scala:76)�[0m�[0m
2023-02-06T13:38:00.5863177Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:863)�[0m�[0m
2023-02-06T13:38:00.5863711Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:863)�[0m�[0m
2023-02-06T13:38:00.5864291Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)�[0m�[0m
2023-02-06T13:38:00.5864877Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)�[0m�[0m
2023-02-06T13:38:00.5865393Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)�[0m�[0m
2023-02-06T13:38:00.5865932Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)�[0m�[0m
2023-02-06T13:38:00.5866461Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at org.apache.spark.scheduler.Task.run(Task.scala:131)�[0m�[0m
2023-02-06T13:38:00.5867017Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)�[0m�[0m
2023-02-06T13:38:00.5867565Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)�[0m�[0m
2023-02-06T13:38:00.5868095Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)�[0m�[0m
2023-02-06T13:38:00.5868696Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)�[0m�[0m
2023-02-06T13:38:00.5869308Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)�[0m�[0m
2023-02-06T13:38:00.5869799Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  	at java.lang.Thread.run(Thread.java:750)�[0m�[0m
2023-02-06T13:38:00.5870133Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m  �[0m
2023-02-06T13:38:00.5870657Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  Driver stacktrace:" did not contain "Illegal Parquet type: INT64 (TIMESTAMP(NANOS,true))." (ParquetSchemaSuite.scala:494)�[0m�[0m
2023-02-06T13:38:00.5871221Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  org.scalatest.exceptions.TestFailedException:�[0m�[0m
2023-02-06T13:38:00.5871789Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)�[0m�[0m
2023-02-06T13:38:00.5872407Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)�[0m�[0m
2023-02-06T13:38:00.5872999Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)�[0m�[0m
2023-02-06T13:38:00.5873557Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)�[0m�[0m
2023-02-06T13:38:00.5874263Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaSuite.$anonfun$new$13(ParquetSchemaSuite.scala:494)�[0m�[0m
2023-02-06T13:38:00.5874954Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)�[0m�[0m
2023-02-06T13:38:00.5875457Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)�[0m�[0m
2023-02-06T13:38:00.5875954Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)�[0m�[0m
2023-02-06T13:38:00.5876449Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.Transformer.apply(Transformer.scala:22)�[0m�[0m
2023-02-06T13:38:00.5876989Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.Transformer.apply(Transformer.scala:20)�[0m�[0m
2023-02-06T13:38:00.5877551Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)�[0m�[0m
2023-02-06T13:38:00.5878120Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:190)�[0m�[0m
2023-02-06T13:38:00.5878751Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)�[0m�[0m
2023-02-06T13:38:00.5879391Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)�[0m�[0m
2023-02-06T13:38:00.5879939Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)�[0m�[0m
2023-02-06T13:38:00.5880500Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)�[0m�[0m
2023-02-06T13:38:00.5881095Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)�[0m�[0m
2023-02-06T13:38:00.5881738Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:62)�[0m�[0m
2023-02-06T13:38:00.5882355Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)�[0m�[0m
2023-02-06T13:38:00.5882924Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m  at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)�[0m�[0m

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-40819][SQL] Backport Timestamp nanos regression to 3.2 [SPARK-40819][SQL][3.2] Timestamp nanos behaviour regression Feb 7, 2023
@HyukjinKwon
Copy link
Member

Merged to branch-3.2.

HyukjinKwon pushed a commit that referenced this pull request Feb 8, 2023
As per HyukjinKwon request on #38312 to backport fix into 3.2
### What changes were proposed in this pull request?

Handle `TimeUnit.NANOS` for parquet `Timestamps` addressing a regression in behaviour since 3.2

### Why are the changes needed?

Since version 3.2 reading parquet files that contain attributes with type `TIMESTAMP(NANOS,true)` is not possible as ParquetSchemaConverter returns
```
Caused by: org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64 (TIMESTAMP(NANOS,true))
```
https://issues.apache.org/jira/browse/SPARK-34661 introduced a change matching on the `LogicalTypeAnnotation` which only covers Timestamp cases for `TimeUnit.MILLIS` and `TimeUnit.MICROS` meaning `TimeUnit.NANOS` would return `illegalType()`

Prior to 3.2 the matching used the `originalType` which for `TIMESTAMP(NANOS,true)` return `null` and therefore resulted to a `LongType`, the change proposed is too consider `TimeUnit.NANOS` and return `LongType` making behaviour the same as before.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added unit test covering this scenario.
Internally deployed to read parquet files that contain `TIMESTAMP(NANOS,true)`

Closes #39905 from awdavidson/ts-nanos-fix-3.2.

Authored-by: alfreddavidson <alfie.davidson9@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
@HyukjinKwon HyukjinKwon closed this Feb 8, 2023
sunchao pushed a commit to sunchao/spark that referenced this pull request Jun 2, 2023
As per HyukjinKwon request on apache#38312 to backport fix into 3.2
### What changes were proposed in this pull request?

Handle `TimeUnit.NANOS` for parquet `Timestamps` addressing a regression in behaviour since 3.2

### Why are the changes needed?

Since version 3.2 reading parquet files that contain attributes with type `TIMESTAMP(NANOS,true)` is not possible as ParquetSchemaConverter returns
```
Caused by: org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64 (TIMESTAMP(NANOS,true))
```
https://issues.apache.org/jira/browse/SPARK-34661 introduced a change matching on the `LogicalTypeAnnotation` which only covers Timestamp cases for `TimeUnit.MILLIS` and `TimeUnit.MICROS` meaning `TimeUnit.NANOS` would return `illegalType()`

Prior to 3.2 the matching used the `originalType` which for `TIMESTAMP(NANOS,true)` return `null` and therefore resulted to a `LongType`, the change proposed is too consider `TimeUnit.NANOS` and return `LongType` making behaviour the same as before.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added unit test covering this scenario.
Internally deployed to read parquet files that contain `TIMESTAMP(NANOS,true)`

Closes apache#39905 from awdavidson/ts-nanos-fix-3.2.

Authored-by: alfreddavidson <alfie.davidson9@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants