-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Support custom avro DatumReader when reading from BigQuery #22718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Assigning reviewers. If you would like to opt out of this review, comment R: @kennknowles for label java. Available commands:
The PR bot will only process comments in the main thread (not review comments). |
...a/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java
Outdated
Show resolved
Hide resolved
...a/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java
Outdated
Show resolved
Hide resolved
|
can we plumb this down into BigQueryStorageAvroReader as well so it'll work with direct reads? |
| * <p> This API allows direct deserialization of AVRO data to the target class. | ||
| */ | ||
| public static <T> TypedRead<T> readWithDatumReader( | ||
| AvroSource.DatumReaderFactory<T> factory, org.apache.avro.Schema readerSchema) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we make the reader schema optional, and use the writer schema if its not set? Thats how AvroIO works IIRC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like AvroIO derives it in case the class is of SpecificData subType. We could do the same but we would need to take the class type as input instead (don't think its cleaner though).
Also, readerSchema is needed by AvroSource in general, as it does validation based on that, if parserFn is not set,
as well as derives the AvroCoder based on the readerSchema. Since it gets the writer schema during runtime from the file metadata, we cannot assign reader schema as writer schema on the submitter side.
yeah, i was thinking of tackling that in another review. I could add in this one too to allow this for different read types. |
that's cool, you can do it in another review, just curious. |
|
Reminder, please take a look at this pr: @kennknowles @chamikaramj |
|
R: @chamikaramj |
|
Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control |
...a/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java
Outdated
Show resolved
Hide resolved
...a/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java
Outdated
Show resolved
Hide resolved
|
Run Java PreCommit |
...a/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java
Show resolved
Hide resolved
...a/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java
Outdated
Show resolved
Hide resolved
|
Run Java PreCommit |
|
Run Java_Examples_Dataflow PreCommit |
|
Run Java_Examples_Dataflow_Java11 PreCommit |
|
Run Java_Examples_Dataflow_Java17 PreCommit |
|
Run Java PreCommit |
|
Run Java PreCommit |
|
I ran and profiled a few dataflow jobs with and without my changes to test for performance regression. Overall the profiling data looks good for these new changes. |
|
I believe this PR broke Java Dataflow PostCommits: https://ci-beam.apache.org/job/beam_PostCommit_Java_DataflowV1/2173/testReport/. See #23541 |
Similar to 11479 this PR adds functionality to directly deserialize Avro records to the target class, via the user specified DatumReader, in BigQueryIO.
R: @pabloem @chamikaramj @steveniemitz
This fixes #22717
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username).addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>instead.CHANGES.mdwith noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.