-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-54609][SQL] Disable TIME type by default #53344
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
7818f04 to
4a43700
Compare
aa1ac43 to
5857c5d
Compare
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you make CI happy, @davidm-db ?
|
cc @MaxGekk , @uros-db , @cloud-fan , @HyukjinKwon , @viirya , @peter-toth , @yaooqinn , @LuciferYang , @vinodkc |
sql/api/src/main/scala/org/apache/spark/sql/types/TimeType.scala
Outdated
Show resolved
Hide resolved
| } | ||
|
|
||
| override def supportDataType(dataType: DataType): Boolean = dataType match { | ||
| case _: TimeType => SQLConf.get. isTimeTypeEnabled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do we block geo types for data sources?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per offline discussion with @uros-db, blocking for Parquet should be sufficient for TIME.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added guards for all file formats (that haven't previously been explicitly marked as not supported for TIME).
|
Could you answer the above comments and make this PR pass the CIs for further discussion, @davidm-db ? |
|
TimeType is marked as Unstable. Is this short-term prohibition actually required? |
|
Yes, we need this to protect the users from the accidental use of unfinished work, @yaooqinn .
|
f58fbfa to
02c68f0
Compare
02c68f0 to
c861aa6
Compare
sql/api/src/main/scala/org/apache/spark/sql/types/TimeType.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala
Outdated
Show resolved
Hide resolved
| val specialDate = convertSpecialDate(value, zoneId).map(Literal(_, DateType)) | ||
| specialDate.getOrElse(toLiteral(stringToDate, DateType)) | ||
| case TIME => toLiteral(stringToTime, TimeType()) | ||
| case TIME if conf.isTimeTypeEnabled => toLiteral(stringToTime, TimeType()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we have the check here, we don't need to update SqlBaseParser.g4 to complicate things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I replicated what Max did internally. I think the reason for this is:
- the code you are commenting is handling literal types (statement example:
SELECT TIME'10:00:00') and is done this way to fit into the existing error message format {time_type_enabled}?guard inSqlBaseParser.g4guards references to the TIME as a type and throws different class of errors, i.e. datatype unsupported (statement example:CREATE TABLE t(col TIME))
I don't know if we want to change this behavior or not, please share your thoughts.
...re/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
Outdated
Show resolved
Hide resolved
|
can we add some test cases following the geo types blocking PRs? |
uros-db
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davidm-db Should we add some tests, e.g.
- e2e sql query tests with config turned off
- blocking data sources like Parquet, CSV
- data frames (classic and spark connect)
- also, there are Scala suites for casting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, given that @yaooqinn 's comment, while working on this PR, we need to build a consensus on this by sending out an email to dev@spark mailing list, @davidm-db , @uros-db , and @cloud-fan .
It would be enough to reply on RC2 email about the TIME type. Maybe, could you send out the decision clearly to the mailing list, please, @cloud-fan , because @MaxGekk is not in the loop yet?
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
Outdated
Show resolved
Hide resolved
| unsupportedType = ctx.literalType.getText, | ||
| supportedTypes = | ||
| // TODO: Remove TIME from the list. | ||
| Seq("DATE", "TIMESTAMP_NTZ", "TIMESTAMP_LTZ", "TIMESTAMP", "INTERVAL", "X", "TIME"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, the following style?
- Seq("DATE", "TIMESTAMP_NTZ", "TIMESTAMP_LTZ", "TIMESTAMP", "INTERVAL", "X", "TIME"),
+ Seq("DATE", "TIMESTAMP_NTZ", "TIMESTAMP_LTZ", "TIMESTAMP", "INTERVAL", "X") ++ (if (conf.isTimeTypeEnabled) Seq("TIME") else None)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, will do this definitely. There are a lot of dependencies and TIME needs to be guarded in a lot of places, so for now I'm just trying to figure out what's needed to make the CI pass. Afterwards, I'll sort out the TODO comment. Hope to finish everything tomorrow!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
Outdated
Show resolved
Hide resolved
f2552ce to
c9cf94c
Compare
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still missing, @davidm-db and @cloud-fan ?
Need to add tests for disabled config.
|
I verified manually that it's blocked properly. Given that, shall we proceed those test case as a follow-up, @davidm-db and @cloud-fan ? |
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
|
I removed all client side checks as it's not very meaningful. People can use the |
|
So, is this the final status from your side, @cloud-fan ? Then, could you give your approval? |
| new GeometryConverter(g) | ||
| case DateType if SQLConf.get.datetimeJava8ApiEnabled => LocalDateConverter | ||
| case DateType => DateConverter | ||
| case _: TimeType if !SQLConf.get.isTimeTypeEnabled => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for the record, we don't have this for GeographyType|GeometryType.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have it here, a few lines above
case _ @ (_: GeographyType | _: GeometryType) if !SQLConf.get.geospatialEnabled =>
throw new org.apache.spark.sql.AnalysisException(
errorClass = "UNSUPPORTED_FEATURE.GEOSPATIAL_DISABLED",
messageParameters = scala.collection.immutable.Map.empty)
| def verifySchema(format: FileFormat, schema: StructType, readOnly: Boolean = false): Unit = { | ||
| if (!SQLConf.get.isTimeTypeEnabled && schema.existsRecursively(_.isInstanceOf[TimeType])) { | ||
| throw QueryCompilationErrors.unsupportedTimeTypeError() | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto. We don't have this for Geo*Type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no data source supports geo types yet, so it's not needed for now. But to be future-proof we should check geo here as well.
| """ | ||
| print("Enabling TIME data type") | ||
| jspark.sql("SET spark.sql.timeType.enabled = true") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have this for Geo*Type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also curious about why geo didn't fail here...
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
+1, LGTM. Thank you so much, @cloud-fan .
|
I manually verified the compilation and Scalalinter and both |
|
Merged to master.
Never mind. I resolved the conflicts and am testing locally on |
Introducing a new SQL config for TIME type: `spark.sql.timeType.enabled`. The default value is `false` and it is enabled only in tests. TIME data type support is not complete, so we need to guard it before it is completed, especially ahead of Spark 4.1 release. No. Need to add tests for disabled config. No. Closes #53344 from davidm-db/davidm-db/time-config. Lead-authored-by: David Milicevic <david.milicevic@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 18a9435) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
|
Merged to branch-4.1, too. |
…ark 4.1 (#11313) * [Fix] Remove `@NotNull` annotations to resolve dependency issues caused by ORC upgrade. apache/spark#51676 * [Fix] Make `reserveNewColumn` public in `ArrowWritableColumnVector` and `WritableColumnVectorShim` In Java, an overriding method's access modifier cannot be more restrictive than the overridden method. Changing from protected to public is safe and ensures compatibility before the Spark version upgrade. see apache/spark#52557 * [Fix] Add GeographyVal and GeometryVal support in ArrowColumnarRow, BatchCarrierRow and ColumnarToCarrierRowExecBase see [SPIP: Add geospatial types in Spark](https://issues.apache.org/jira/browse/SPARK-51658) * [Fix] Update commons-collections to version 4.5.0. see apache/spark#52743 * [Fix] Enable SPARK_TESTING environment variable for Spark test jobs see apache/spark#53344

What changes were proposed in this pull request?
Introducing a new SQL config for TIME type:
spark.sql.timeType.enabled.The default value is
falseand it is enabled only in tests.Why are the changes needed?
TIME data type support is not complete, so we need to guard it before it is completed, especially ahead of Spark 4.1 release.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Need to add tests for disabled config.
Was this patch authored or co-authored using generative AI tooling?
No.