Use writer schema only for BigQueryIO Read API #23594

kkdoon · 2022-10-12T05:31:56Z

This PR fixes #23541

PR 22718 updated the behavior of BigQueryIO read and readTableRows API where it uses both the Avro writer and reader schema to decode BigQuery elements. However, before this change, these APIs only relied on the writer schema to deserialize elements. The issue with using reader schema is that its derived via the BigQueryAvroUtils.toGenericAvroSchema function that does not correctly handle TableFieldSchema -> Avro schema conversion for Avro schemas containing LogicalTypes.

R: @steveniemitz

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Choose reviewer(s) and mention them in a comment (R: @username).
Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI.

github-actions · 2022-10-12T06:13:15Z

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @kileys for label java.
R: @Abacn for label io.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

ahmedabu98 · 2022-10-12T13:23:27Z

Run PostCommit_Java_DataflowV2

ahmedabu98 · 2022-10-12T13:23:36Z

Run PostCommit_Java_Dataflow

steveniemitz · 2022-10-12T13:38:17Z

Can we get unit test coverage on this case as well? I still don't completely understand how this fixed it (or how it broke in the first place).

edit: ok, I see what the problem was now. So in the past we were never setting a schema on the AvroSource, causing it to use the writer schema for both. Should we instead not specify the reader schema at all on the AvroSource (ie revert it back to how it was)? Given that it seems like the schema created by BigQueryAvroUtils.toGenericAvroSchema is "wrong"?

edit 2: It doesn't seem like there's much value in passing a reader schema at all (to AvroSource), since its always derived from the table schema that we're reading, so any "evolution" happening between the writer and reader would really be unintentional.

kkdoon · 2022-10-12T18:40:35Z

so any "evolution" happening between the writer and reader would really be unintentional.

yeah your understanding is correct. The reader schema is mostly passed to satisfy AvroSource validation, which expects users to pass readerSchema at compile time, since it forms a coder based on that. We could either update the AvroSource validation logic and/or update the BigQueryAvroUtils.toGenericAvroSchema logic to correctly generate schema for numeric types.

steveniemitz · 2022-10-12T18:50:44Z

ah I see, before we were passing a parseFn so it was bypassing the validation that we needed to have a schema. It seems like that coder from AvroSource is never actually used in our case, but I understand why the validation would be there in general.

Fixing toGenericAvroSchema seems like a good solution no matter what, but for now what you have here I think makes a lot of sense as a work around until we can fix the root cause.

kkdoon · 2022-10-12T19:04:42Z

ah I see, before we were passing a parseFn so it was bypassing the validation that we needed to have a schema. It seems like that coder from AvroSource is never actually used in our case, but I understand why the validation would be there in general.

Fixing toGenericAvroSchema seems like a good solution no matter what, but for now what you have here I think makes a lot of sense as a work around until we can fix the root cause.

sounds good, let me update the toGenericAvroSchema logic and try adding a unit test for the same.

steveniemitz · 2022-10-12T19:06:02Z

@ahmedabu98 thoughts on landing this now to get things back to working, then follow up with fixing the root cause later?

ahmedabu98 · 2022-10-12T20:46:06Z

@steveniemitz SGTM

steveniemitz · 2022-10-12T20:47:08Z

cool, the failing postcommit tests are unrelated to this now.

use write schema only for read api

eef7ce2

github-actions bot added gcp io java labels Oct 12, 2022

kkdoon mentioned this pull request Oct 12, 2022

BigQuerySamplesIT.testTableIO is breaking in Java PostCommits #23541

Closed

github-actions bot added the Next Action: Reviewers label Oct 12, 2022

steveniemitz self-assigned this Oct 12, 2022

steveniemitz approved these changes Oct 12, 2022

View reviewed changes

steveniemitz merged commit 8e52d87 into apache:master Oct 12, 2022

steveniemitz deleted the bqio-parsefn-schema-bug branch October 12, 2022 20:52

kennknowles mentioned this pull request May 3, 2023

[Bug]: BigQuerySourceBase does not propagate a Coder to AvroSource #26329

Open

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use writer schema only for BigQueryIO Read API #23594

Use writer schema only for BigQueryIO Read API #23594

Uh oh!

kkdoon commented Oct 12, 2022 •

edited

Loading

Uh oh!

github-actions bot commented Oct 12, 2022

Uh oh!

ahmedabu98 commented Oct 12, 2022

Uh oh!

ahmedabu98 commented Oct 12, 2022

Uh oh!

steveniemitz commented Oct 12, 2022 •

edited

Loading

Uh oh!

kkdoon commented Oct 12, 2022

Uh oh!

steveniemitz commented Oct 12, 2022

Uh oh!

kkdoon commented Oct 12, 2022

Uh oh!

steveniemitz commented Oct 12, 2022

Uh oh!

ahmedabu98 commented Oct 12, 2022

Uh oh!

steveniemitz commented Oct 12, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Use writer schema only for BigQueryIO Read API #23594

Use writer schema only for BigQueryIO Read API #23594

Uh oh!

Conversation

kkdoon commented Oct 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GitHub Actions Tests Status (on master branch)

Uh oh!

github-actions bot commented Oct 12, 2022

Uh oh!

ahmedabu98 commented Oct 12, 2022

Uh oh!

ahmedabu98 commented Oct 12, 2022

Uh oh!

steveniemitz commented Oct 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kkdoon commented Oct 12, 2022

Uh oh!

steveniemitz commented Oct 12, 2022

Uh oh!

kkdoon commented Oct 12, 2022

Uh oh!

steveniemitz commented Oct 12, 2022

Uh oh!

ahmedabu98 commented Oct 12, 2022

Uh oh!

steveniemitz commented Oct 12, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kkdoon commented Oct 12, 2022 •

edited

Loading

steveniemitz commented Oct 12, 2022 •

edited

Loading