Skip to content

Conversation

@robinyqiu
Copy link
Contributor

@robinyqiu robinyqiu commented Mar 31, 2020

This PR adds support of all ZetaSQL (BigQuery Standard SQL) DATE functions to BeamSQL:

  • CURRENT_DATE
  • EXTRACT
  • DATE (constructing DATE from DATETIME not supported)
  • DATE_ADD
  • DATE_SUB
  • DATE_DIFF
  • DATE_TRUNC
  • FORMAT_DATE
  • PARSE_DATE
  • UNIX_DATE
  • DATE_FROM_UNIX_DATE
  • WEEK part not supported

r: @apilloud
cc: @TheNeuralBit @kennknowles


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

Post-Commit Tests Status (on master branch)

Lang SDK Apex Dataflow Flink Gearpump Samza Spark
Go Build Status --- --- Build Status --- --- Build Status
Java Build Status Build Status Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status Build Status
Build Status
Build Status
Python Build Status
Build Status
Build Status
Build Status
--- Build Status
Build Status
Build Status
Build Status
Build Status
--- --- Build Status
XLang --- --- --- Build Status --- --- Build Status

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website
Non-portable Build Status Build Status
Build Status
Build Status Build Status
Portable --- Build Status --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

@robinyqiu robinyqiu changed the title Support ZetaSQL DATE type as a Beam LogicalType [BEAM-9641] Support ZetaSQL DATE type as a Beam LogicalType Mar 31, 2020
@robinyqiu robinyqiu force-pushed the date-and-time branch 4 times, most recently from 3003dbb to cca562d Compare April 8, 2020 05:16
@TheNeuralBit
Copy link
Member

What do you think about going ahead and defining the date logical type in org.apache.beam.sdk.schemas.logicaltypes? It would be useful in other contexts - for example it would give us something to map Avro's logical date type to (currently it is just overloaded with millis-instant onto DATETIME)

cc: @reuvenlax

@robinyqiu
Copy link
Contributor Author

What do you think about going ahead and defining the date logical type in org.apache.beam.sdk.schemas.logicaltypes? It would be useful in other contexts - for example it would give us something to map Avro's logical date type to (currently it is just overloaded with millis-instant onto DATETIME)

Done. Thanks for the suggestion. I made the Date type a public logical type in org.apache.beam.sdk.schemas.logicaltypes and added a layer of indirection by letting SqlTypes.DATE reference it.

Copy link
Member

@apilloud apilloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can you clean up the style and call equals on the constant instead of logicalId (which could be null).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about using a switch statement? Is there any style guidance on using switch on a String in java?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I hope I could use a switch statement here, but unfortunately there is no constant IDENTIFIER defined in the LogicalType class. (I could add it to each concrete SQL logical type I create, but I don't think that is a good style.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Call equals on constant to avoid null issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This order of equals is awesome!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is worth documenting that the Long is an offset from an epoch (and what that epoch is).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm reading the correctly, LocalDate is the in memory type (a struct) and Long is the wire format (an offset from epoch)? This conversion could be quite expensive. It appears the Calc nodes both take an offset in this case, when we start to think about performance we might need to change the in memory type to be offset based.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(If changing the in memory type is going to be difficult in the future, consider doing that now.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is unfortunate... but what in-memory type should we use instead? joda.time.LocalDate uses a millisecond long, do we want to add another joda dependency?

We could access the base type (wire format type) directly in SQL with Row#getBaseValue, but unfortunately Rows store logical types as the input type (in memory format type), so that wouldn't actually avoid a conversion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess java.sql.Date is another option for a java type backed by millis.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider not using a JVM, it adds performance overhead too. 🤓

I'm reasonably convinced the wire format is good and the conversion here is lossless, so if there isn't a easy drop-in replacement leave this as is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Andrew is basically suggesting using a PassThroughLogicalType<Long> as a logical type for DATE. I think we could definitely consider this if performance becomes a problem in the future. (It's not easy to change the in-memory type for Date after it is made public, but we can easily define a new SqlDate.) For now I think we can leave it as is. It's more human readable (e.g. writing tests for DATE type in spec tests is simpler).

@apilloud
Copy link
Member

Oops, forgot to include in my comments: ZetaSQL's range is much smaller than the underlying type, can you add a test or two for that? How do out of range values fail? (Also worth asking, do we need any special treatment for boundary conditions (LocalDate.MIN, LocalDate.MAX)? Probably not for now.)

@robinyqiu
Copy link
Contributor Author

Ah, just realized that the previous comments were not sent out.

@robinyqiu
Copy link
Contributor Author

Could you help trigger the tests again?

For the comment on range: Thanks for pointing it out. I overlooked this problem. I would like to create a separate PR to address it, along with range testing for other types as well.

@apilloud
Copy link
Member

apilloud commented May 6, 2020

retest this please

@robinyqiu
Copy link
Contributor Author

The failing test SparkPortableExecutionTest.testExecution should be unrelated to this change.

@apilloud
Copy link
Member

apilloud commented May 7, 2020

Run Java PreCommit

@apilloud
Copy link
Member

apilloud commented May 7, 2020

Run SQL Postcommit

@robinyqiu
Copy link
Contributor Author

Rebased against master. Please run precommit tests again.

@apilloud
Copy link
Member

retest this please

1 similar comment
@apilloud
Copy link
Member

retest this please

@robinyqiu
Copy link
Contributor Author

Java PreCommit failed due to a build failure. Please help run again.

@TheNeuralBit
Copy link
Member

Run Java PreCommit

2 similar comments
@TheNeuralBit
Copy link
Member

Run Java PreCommit

@TheNeuralBit
Copy link
Member

Run Java PreCommit

@apilloud apilloud merged commit 47c246b into apache:master May 18, 2020
@TheNeuralBit
Copy link
Member

TheNeuralBit commented May 21, 2020

Something just occurred to me - are there any tests that use the DATE Type in an aggregation (e.g. MAX)?

I'd think that would run into the same issue I have in #11456 (processing logical types using their representation)

@apilloud
Copy link
Member

Interesting question. You should probably add a test for JOIN as well, which will have a similar class of problems.

@robinyqiu
Copy link
Contributor Author

are there any tests that use the DATE Type in an aggregation (e.g. MAX)?

No. Thanks for bringing this up. I think it is likely to run into the problem.

@robinyqiu robinyqiu deleted the date-and-time branch June 11, 2020 23:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants