-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Use writer schema only for BigQueryIO Read API #23594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use writer schema only for BigQueryIO Read API #23594
Conversation
|
Assigning reviewers. If you would like to opt out of this review, comment R: @kileys for label java. Available commands:
The PR bot will only process comments in the main thread (not review comments). |
|
Run PostCommit_Java_DataflowV2 |
|
Run PostCommit_Java_Dataflow |
|
Can we get unit test coverage on this case as well? I still don't completely understand how this fixed it (or how it broke in the first place). edit: ok, I see what the problem was now. So in the past we were never setting a schema on the AvroSource, causing it to use the writer schema for both. Should we instead not specify the reader schema at all on the AvroSource (ie revert it back to how it was)? Given that it seems like the schema created by edit 2: It doesn't seem like there's much value in passing a reader schema at all (to AvroSource), since its always derived from the table schema that we're reading, so any "evolution" happening between the writer and reader would really be unintentional. |
yeah your understanding is correct. The reader schema is mostly passed to satisfy AvroSource validation, which expects users to pass readerSchema at compile time, since it forms a coder based on that. We could either update the AvroSource validation logic and/or update the |
|
ah I see, before we were passing a parseFn so it was bypassing the validation that we needed to have a schema. It seems like that coder from AvroSource is never actually used in our case, but I understand why the validation would be there in general. Fixing |
sounds good, let me update the toGenericAvroSchema logic and try adding a unit test for the same. |
|
@ahmedabu98 thoughts on landing this now to get things back to working, then follow up with fixing the root cause later? |
|
@steveniemitz SGTM |
|
cool, the failing postcommit tests are unrelated to this now. |
This PR fixes #23541
PR 22718 updated the behavior of BigQueryIO read and readTableRows API where it uses both the Avro writer and reader schema to decode BigQuery elements. However, before this change, these APIs only relied on the writer schema to deserialize elements. The issue with using reader schema is that its derived via the
BigQueryAvroUtils.toGenericAvroSchemafunction that does not correctly handle TableFieldSchema -> Avro schema conversion for Avro schemas containing LogicalTypes.R: @steveniemitz
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username).addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>instead.CHANGES.mdwith noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.