Skip to content

Conversation

@kkdoon
Copy link
Contributor

@kkdoon kkdoon commented Oct 12, 2022

This PR fixes #23541

PR 22718 updated the behavior of BigQueryIO read and readTableRows API where it uses both the Avro writer and reader schema to decode BigQuery elements. However, before this change, these APIs only relied on the writer schema to deserialize elements. The issue with using reader schema is that its derived via the BigQueryAvroUtils.toGenericAvroSchema function that does not correctly handle TableFieldSchema -> Avro schema conversion for Avro schemas containing LogicalTypes.

R: @steveniemitz


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI.

@github-actions
Copy link
Contributor

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @kileys for label java.
R: @Abacn for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

@ahmedabu98
Copy link
Contributor

Run PostCommit_Java_DataflowV2

@ahmedabu98
Copy link
Contributor

Run PostCommit_Java_Dataflow

@steveniemitz
Copy link
Contributor

steveniemitz commented Oct 12, 2022

Can we get unit test coverage on this case as well? I still don't completely understand how this fixed it (or how it broke in the first place).

edit: ok, I see what the problem was now. So in the past we were never setting a schema on the AvroSource, causing it to use the writer schema for both. Should we instead not specify the reader schema at all on the AvroSource (ie revert it back to how it was)? Given that it seems like the schema created by BigQueryAvroUtils.toGenericAvroSchema is "wrong"?

edit 2: It doesn't seem like there's much value in passing a reader schema at all (to AvroSource), since its always derived from the table schema that we're reading, so any "evolution" happening between the writer and reader would really be unintentional.

@kkdoon
Copy link
Contributor Author

kkdoon commented Oct 12, 2022

so any "evolution" happening between the writer and reader would really be unintentional.

yeah your understanding is correct. The reader schema is mostly passed to satisfy AvroSource validation, which expects users to pass readerSchema at compile time, since it forms a coder based on that. We could either update the AvroSource validation logic and/or update the BigQueryAvroUtils.toGenericAvroSchema logic to correctly generate schema for numeric types.

@steveniemitz
Copy link
Contributor

ah I see, before we were passing a parseFn so it was bypassing the validation that we needed to have a schema. It seems like that coder from AvroSource is never actually used in our case, but I understand why the validation would be there in general.

Fixing toGenericAvroSchema seems like a good solution no matter what, but for now what you have here I think makes a lot of sense as a work around until we can fix the root cause.

@kkdoon
Copy link
Contributor Author

kkdoon commented Oct 12, 2022

ah I see, before we were passing a parseFn so it was bypassing the validation that we needed to have a schema. It seems like that coder from AvroSource is never actually used in our case, but I understand why the validation would be there in general.

Fixing toGenericAvroSchema seems like a good solution no matter what, but for now what you have here I think makes a lot of sense as a work around until we can fix the root cause.

sounds good, let me update the toGenericAvroSchema logic and try adding a unit test for the same.

@steveniemitz
Copy link
Contributor

@ahmedabu98 thoughts on landing this now to get things back to working, then follow up with fixing the root cause later?

@ahmedabu98
Copy link
Contributor

@steveniemitz SGTM

@steveniemitz
Copy link
Contributor

cool, the failing postcommit tests are unrelated to this now.

@steveniemitz steveniemitz self-assigned this Oct 12, 2022
@steveniemitz steveniemitz merged commit 8e52d87 into apache:master Oct 12, 2022
@steveniemitz steveniemitz deleted the bqio-parsefn-schema-bug branch October 12, 2022 20:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BigQuerySamplesIT.testTableIO is breaking in Java PostCommits

3 participants