Fix UpdateSchemaDestination breaking DynamicDestination in Bigquery BatchLoad #25410

Abacn · 2023-02-09T22:12:00Z

Fixes #25355

Handle dynamic table destinations in UpdateSchemaDestination impl
Add ZERO_LOAD job type for schema update load
Fix BigQuerySchemaUpdateOptionsIT to actually test temp tableis scenario

Rewrite BigQuerySchemaUpdateOptionsIT.runWriteTestTempTable to test dynamicDestination scenario

Please add a meaningful description for your change here

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI.

Abacn · 2023-02-09T22:41:18Z

The current UpdateSchemaDestination implementation in Java does not consider the dynamic destination. It simply takes the first seen destination. This causes problems in many ways. The integration test supposed to test this class did not work either. It should set .withMaxFileSize() instead of .withMaxBytesPerPartition() to get multiple files then multiple partitions. Otherwise there was a single file written and did not access to MultiPartitions branch in the pipeline.

Abacn · 2023-02-09T22:53:30Z

...loud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/UpdateSchemaDestination.java

    for (KV<DestinationT, WriteTables.Result> entry : element) {
-      destination = entry.getKey();
-      if (destination != null) {


I neglected the handling of null destination in the original implementation. I do not see what kind of scenario would have null destination (unless user provided DynamicDestination.getDestination returns a null, but the documentation says it may not return null). Nevertheless, the upstream WriteTempTables essentially has this processElement calling dynamicDestinations.getTable(destination) for all incoming elements:

beam/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java

Line 221 in 78c1564

TableDestination tableDestination = dynamicDestinations.getTable(destination);

and asking it return a nonNull tableDestination. (same call used in UpdateSchemaDestination). So make the behavior consistent to WriteTables here

…atchLoad * Handle dynamic table destination in UpdateSchemaDestination impl * Add ZERO_LOAD job type for schema update load * Fix BigQuerySchemaUpdateOptionsIT to actually test temp tableis scenario Rewrite BigQuerySchemaUpdateOptionsIT.runWriteTestTempTable to test dynamicDestination scenario

github-actions · 2023-02-10T04:46:56Z

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @kennknowles for label java.
R: @ahmedabu98 for label io.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

github-actions · 2023-02-10T04:47:27Z

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @kennknowles for label java.
R: @johnjcasey for label io.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

ahmedabu98

Thanks a lot for this fix! This unblocks a popular use case of large writes to dynamic destinations. Left a few comments, mainly I think that seeing ZERO_LOAD in logs will be confusing to people who don't have context around this. Maybe a more clear name will help. Besides that this PR looks great

...cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryResourceNaming.java

ahmedabu98 · 2023-02-10T14:51:30Z

...a/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java

+        Lists.newArrayList(zeroLoadJobIdPrefixView);
    sideInputsForUpdateSchema.addAll(dynamicDestinations.getSideInputs());

    PCollection<TableDestination> successfulMultiPartitionWrites =


Would it make sense to have a GBK after writeTempTables? Re your comment on the issue we'd end up with a PCollection<KV<DestinationT, Iterable<WriteTables.Result>>>. Also would protect writeTempTables against retry if UpdateSchemaDestination fails

I see this would need some changes in WriteRename as well, perhaps this could be an improvement in a separate PR.

after leaving that comment I find using a gbk may break the use case when the final destination table gets updated more than once, e.g. in streaming file load. Of course careful windowing we can avoid it, however I did not look deep into it at this moment. So far I leave the overall structure of the pipeline unchanged.

using a gbk may break the use case when the final destination table gets updated more than once, e.g. in streaming file load

Ideally, this check would prevent that if it worked as intended:

beam/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/UpdateSchemaDestination.java

Line 263 in feb248a

|| destinationTable.getSchema().equals(schema)

yeah agree, ideally. Just not confident enough and want to keep the change limited to what necessary to fix bug (though change is already not minor)

ahmedabu98

LGTM if tests pass

Abacn · 2023-02-10T20:21:00Z

Thanks, https://ci-beam.apache.org/job/beam_PreCommit_Java_Examples_Dataflow_Java17_Commit/5611/ passed though not updated on GitHub UI

Abacn commented Feb 9, 2023

View reviewed changes

Abacn force-pushed the fixupdschmdest branch from d19b758 to 2c51983 Compare February 9, 2023 22:56

github-actions bot added gcp io java labels Feb 10, 2023

Abacn marked this pull request as ready for review February 10, 2023 04:38

github-actions bot added the Next Action: Reviewers label Feb 10, 2023

ahmedabu98 requested changes Feb 10, 2023

View reviewed changes

Rename ZERO_LOAD -> SCHEMA_UPDATE

af43ab5

ahmedabu98 approved these changes Feb 10, 2023

View reviewed changes

Abacn merged commit be66a60 into apache:master Feb 10, 2023

Abacn deleted the fixupdschmdest branch February 10, 2023 20:23

Fix UpdateSchemaDestination breaking DynamicDestination in Bigquery BatchLoad #25410

Fix UpdateSchemaDestination breaking DynamicDestination in Bigquery BatchLoad #25410

Uh oh!

Conversation

Abacn commented Feb 9, 2023

GitHub Actions Tests Status (on master branch)

Uh oh!

Abacn commented Feb 9, 2023

Uh oh!

Abacn Feb 9, 2023

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Feb 10, 2023

Uh oh!

github-actions bot commented Feb 10, 2023

Uh oh!

ahmedabu98 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ahmedabu98 Feb 10, 2023

Choose a reason for hiding this comment

Uh oh!

ahmedabu98 Feb 10, 2023

Choose a reason for hiding this comment

Uh oh!

Abacn Feb 10, 2023

Choose a reason for hiding this comment

Uh oh!

ahmedabu98 Feb 10, 2023

Choose a reason for hiding this comment

Uh oh!

Abacn Feb 10, 2023

Choose a reason for hiding this comment

Uh oh!

ahmedabu98 left a comment

Choose a reason for hiding this comment

Uh oh!

Abacn commented Feb 10, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants