Skip to content

Conversation

@Abacn
Copy link
Contributor

@Abacn Abacn commented Feb 9, 2023

Fixes #25355

  • Handle dynamic table destinations in UpdateSchemaDestination impl

  • Add ZERO_LOAD job type for schema update load

  • Fix BigQuerySchemaUpdateOptionsIT to actually test temp tableis scenario

Rewrite BigQuerySchemaUpdateOptionsIT.runWriteTestTempTable to test dynamicDestination scenario

Please add a meaningful description for your change here


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI.

@Abacn
Copy link
Contributor Author

Abacn commented Feb 9, 2023

The current UpdateSchemaDestination implementation in Java does not consider the dynamic destination. It simply takes the first seen destination. This causes problems in many ways. The integration test supposed to test this class did not work either. It should set .withMaxFileSize() instead of .withMaxBytesPerPartition() to get multiple files then multiple partitions. Otherwise there was a single file written and did not access to MultiPartitions branch in the pipeline.

for (KV<DestinationT, WriteTables.Result> entry : element) {
destination = entry.getKey();
if (destination != null) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I neglected the handling of null destination in the original implementation. I do not see what kind of scenario would have null destination (unless user provided DynamicDestination.getDestination returns a null, but the documentation says it may not return null). Nevertheless, the upstream WriteTempTables essentially has this processElement calling dynamicDestinations.getTable(destination) for all incoming elements:

TableDestination tableDestination = dynamicDestinations.getTable(destination);

and asking it return a nonNull tableDestination. (same call used in UpdateSchemaDestination). So make the behavior consistent to WriteTables here

…atchLoad

* Handle dynamic table destination in UpdateSchemaDestination impl

* Add ZERO_LOAD job type for schema update load

* Fix BigQuerySchemaUpdateOptionsIT to actually test temp tableis scenario

Rewrite BigQuerySchemaUpdateOptionsIT.runWriteTestTempTable to test dynamicDestination scenario
@Abacn Abacn marked this pull request as ready for review February 10, 2023 04:38
@github-actions
Copy link
Contributor

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @kennknowles for label java.
R: @ahmedabu98 for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

@github-actions
Copy link
Contributor

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @kennknowles for label java.
R: @johnjcasey for label io.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

Copy link
Contributor

@ahmedabu98 ahmedabu98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this fix! This unblocks a popular use case of large writes to dynamic destinations. Left a few comments, mainly I think that seeing ZERO_LOAD in logs will be confusing to people who don't have context around this. Maybe a more clear name will help. Besides that this PR looks great

Lists.newArrayList(zeroLoadJobIdPrefixView);
sideInputsForUpdateSchema.addAll(dynamicDestinations.getSideInputs());

PCollection<TableDestination> successfulMultiPartitionWrites =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to have a GBK after writeTempTables? Re your comment on the issue we'd end up with a PCollection<KV<DestinationT, Iterable<WriteTables.Result>>>. Also would protect writeTempTables against retry if UpdateSchemaDestination fails

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this would need some changes in WriteRename as well, perhaps this could be an improvement in a separate PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after leaving that comment I find using a gbk may break the use case when the final destination table gets updated more than once, e.g. in streaming file load. Of course careful windowing we can avoid it, however I did not look deep into it at this moment. So far I leave the overall structure of the pipeline unchanged.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using a gbk may break the use case when the final destination table gets updated more than once, e.g. in streaming file load

Ideally, this check would prevent that if it worked as intended:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah agree, ideally. Just not confident enough and want to keep the change limited to what necessary to fix bug (though change is already not minor)

Copy link
Contributor

@ahmedabu98 ahmedabu98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if tests pass

@Abacn
Copy link
Contributor Author

Abacn commented Feb 10, 2023

Thanks, https://ci-beam.apache.org/job/beam_PreCommit_Java_Examples_Dataflow_Java17_Commit/5611/ passed though not updated on GitHub UI

@Abacn Abacn merged commit be66a60 into apache:master Feb 10, 2023
@Abacn Abacn deleted the fixupdschmdest branch February 10, 2023 20:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: BigQuery BatchLoad incompatible table schema error

2 participants