-
Notifications
You must be signed in to change notification settings - Fork 4.5k
[BEAM-11408] Integrate BigQuery sink streaming inserts with GroupIntoBatches #13496
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
468b4b7 to
d187951
Compare
...-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchedStreamingWrite.java
Outdated
Show resolved
Hide resolved
d187951 to
3b0992c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about this limit. What would be a proper value? Should we make it configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be hones, I am not sure what's a good duration either. I think this is acceptable for now, until we find out more. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah sounds reasonable to proceed with this for now.
...e-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/StreamingWriteTables.java
Outdated
Show resolved
Hide resolved
...google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryOptions.java
Outdated
Show resolved
Hide resolved
...-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchedStreamingWrite.java
Outdated
Show resolved
Hide resolved
...-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchedStreamingWrite.java
Outdated
Show resolved
Hide resolved
...-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchedStreamingWrite.java
Outdated
Show resolved
Hide resolved
...-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchedStreamingWrite.java
Outdated
Show resolved
Hide resolved
...-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchedStreamingWrite.java
Outdated
Show resolved
Hide resolved
...-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchedStreamingWrite.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are relying on the fact that the GroupIntoBatches produces stable output. Really we should tag this with RequiresStableInput. Can you find out if this is safe to do in Dataflow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a transform override in Dataflow to add a preceding Reshuffle to DoFns marked with RequiresStableInput.
Line 73 in 74ec609
| .apply("Materialize input", Reshuffle.viaRandomKey()) |
The override is disabled though. So I guess currently Dataflow does nothing for this tag.
Line 601 in df74d74
| /* TODO[Beam-4684]: Support @RequiresStableInput on Dataflow in a more intelligent way |
Also my understanding is that dding a Reshuffle before GroupIntoBatches will introduce an extra shuffle as Reshuffle is essentially a GBK + value expansion in Dataflow.
...-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchedStreamingWrite.java
Outdated
Show resolved
Hide resolved
f8790e3 to
0ab05d9
Compare
|
@reuvenlax I also have changes for FILE_LOADS ready in my local branch. If it is ok for you, I can merge those into this PR. Otherwise I will send a follow-up PR. |
|
R: @pabloem |
Hey Pablo, let me know if it is ok for you to include the changes in FILE_LOADS in this PR as well. If so, I will push a new commit. |
|
Run Java PostCommit |
|
Run Java PreCommit |
|
There are a couple checkstyle warnings on Precommit: These mean that the variables should be static to have CAPITALIZED_NAMES, or should be named with camelCase if they are not static. Can you fix that? Also, it seems there are some merge conflicts. Can you fix that as well? The last thing to figure out is how to address Reuven's comment regarding stable input |
0ab05d9 to
c033215
Compare
|
Run Java PostCommit |
|
Run Java_Examples_Dataflow_Java11 PreCommit |
Thanks for pointing out! Pushed a commit to fix those.
Done.
Based on my understanding Dataflow currently doesn't do anything for @RequiresStableInputs so it may be considered as safe for now but if we add naive support like a Reshuffle it would be adding duplicated shuffles. How about adding a TODO here so we don't forget? |
|
There seems to be an issue with a try/catch: https://ci-beam.apache.org/job/beam_PreCommit_Java_Examples_Dataflow_Commit/12361/console |
|
Run Java PostCommit |
|
Run Java PreCommit |
|
Postcommit from previous commit: https://ci-beam.apache.org/job/beam_PostCommit_Java_PR/558/ |
|
Thanks @nehsyc ! |
…sink streaming inserts with GroupIntoBatches * Integrate BQ streaming inserts with GroupIntoBatches * Moved autosharding option from BigQueryOption to BigQueryIOBuilder; addressed comments. * fix checkstyle error * Revert the logic that was dropped during merge * Add comments for RequiresStableInput
Use
GroupIntoBatches.WithShardedKeyAPI to group and batch write before streaming to BigQuery service. Currently batching is done best-effort on bundle finalization.This PR
BigQueryOptionsto toggle between the existing and new implementation;BatchedStreamingWriteand provides an option to choose the implementation.Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username).[BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replaceBEAM-XXXwith the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.CHANGES.mdwith noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
Post-Commit Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.