Skip to content

Conversation

@nehsyc
Copy link
Contributor

@nehsyc nehsyc commented Dec 7, 2020

Use GroupIntoBatches.WithShardedKey API to group and batch write before streaming to BigQuery service. Currently batching is done best-effort on bundle finalization.

This PR

  • adds an option to BigQueryOptions to toggle between the existing and new implementation;
  • extracts the shared code between the old and new implementation to a class BatchedStreamingWrite and provides an option to choose the implementation.

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Choose reviewer(s) and mention them in a comment (R: @username).
  • Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

Post-Commit Tests Status (on master branch)

Lang SDK Dataflow Flink Samza Spark Twister2
Go Build Status --- Build Status --- Build Status ---
Java Build Status Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status
Build Status
Build Status
Build Status
Python Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
--- Build Status ---
XLang Build Status Build Status Build Status --- Build Status ---

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website Whitespace Typescript
Non-portable Build Status Build Status
Build Status
Build Status
Build Status
Build Status Build Status Build Status Build Status
Portable --- Build Status --- --- --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests

See CI.md for more information about GitHub Actions CI.

@nehsyc
Copy link
Contributor Author

nehsyc commented Dec 7, 2020

R: @reuvenlax @chamikaramj

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this limit. What would be a proper value? Should we make it configurable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be hones, I am not sure what's a good duration either. I think this is acceptable for now, until we find out more. Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sounds reasonable to proceed with this for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are relying on the fact that the GroupIntoBatches produces stable output. Really we should tag this with RequiresStableInput. Can you find out if this is safe to do in Dataflow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a transform override in Dataflow to add a preceding Reshuffle to DoFns marked with RequiresStableInput.

The override is disabled though. So I guess currently Dataflow does nothing for this tag.

/* TODO[Beam-4684]: Support @RequiresStableInput on Dataflow in a more intelligent way

Also my understanding is that dding a Reshuffle before GroupIntoBatches will introduce an extra shuffle as Reshuffle is essentially a GBK + value expansion in Dataflow.

@nehsyc nehsyc force-pushed the bq-integration-java branch 2 times, most recently from f8790e3 to 0ab05d9 Compare January 9, 2021 00:52
@nehsyc nehsyc requested a review from reuvenlax January 11, 2021 17:37
@nehsyc
Copy link
Contributor Author

nehsyc commented Jan 15, 2021

@reuvenlax I also have changes for FILE_LOADS ready in my local branch. If it is ok for you, I can merge those into this PR. Otherwise I will send a follow-up PR.

@nehsyc
Copy link
Contributor Author

nehsyc commented Jan 26, 2021

R: @pabloem

@nehsyc
Copy link
Contributor Author

nehsyc commented Jan 26, 2021

R: @pabloem

Hey Pablo, let me know if it is ok for you to include the changes in FILE_LOADS in this PR as well. If so, I will push a new commit.

@pabloem
Copy link
Member

pabloem commented Jan 27, 2021

Run Java PostCommit

@pabloem
Copy link
Member

pabloem commented Jan 27, 2021

Run Java PreCommit

@pabloem
Copy link
Member

pabloem commented Jan 27, 2021

There are a couple checkstyle warnings on Precommit:

[ant:checkstyle] [ERROR] /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Java_Phrase/src/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchedStreamingWrite.java:85:24: Name 'LOG' must match pattern '^[a-z][a-zA-Z0-9]*$'. [MemberName] |  
  | [ant:checkstyle] [ERROR] /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Java_Phrase/src/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchedStreamingWrite.java:279:28: Name 'BATCH_MAX_BUFFERING_DURATION' must match pattern '^[a-z][a-zA-Z0-9]*$'. [MemberName]

These mean that the variables should be static to have CAPITALIZED_NAMES, or should be named with camelCase if they are not static. Can you fix that?

Also, it seems there are some merge conflicts. Can you fix that as well?

The last thing to figure out is how to address Reuven's comment regarding stable input

@nehsyc nehsyc force-pushed the bq-integration-java branch from 0ab05d9 to c033215 Compare January 27, 2021 22:31
@pabloem
Copy link
Member

pabloem commented Jan 27, 2021

Run Java PostCommit

@pabloem
Copy link
Member

pabloem commented Jan 27, 2021

Run Java_Examples_Dataflow_Java11 PreCommit

@nehsyc
Copy link
Contributor Author

nehsyc commented Jan 27, 2021

There are a couple checkstyle warnings on Precommit:

[ant:checkstyle] [ERROR] /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Java_Phrase/src/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchedStreamingWrite.java:85:24: Name 'LOG' must match pattern '^[a-z][a-zA-Z0-9]*$'. [MemberName] |  
  | [ant:checkstyle] [ERROR] /home/jenkins/jenkins-slave/workspace/beam_PreCommit_Java_Phrase/src/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchedStreamingWrite.java:279:28: Name 'BATCH_MAX_BUFFERING_DURATION' must match pattern '^[a-z][a-zA-Z0-9]*$'. [MemberName]

These mean that the variables should be static to have CAPITALIZED_NAMES, or should be named with camelCase if they are not static. Can you fix that?

Thanks for pointing out! Pushed a commit to fix those.

Also, it seems there are some merge conflicts. Can you fix that as well?

Done.

The last thing to figure out is how to address Reuven's comment regarding stable input

Based on my understanding Dataflow currently doesn't do anything for @RequiresStableInputs so it may be considered as safe for now but if we add naive support like a Reshuffle it would be adding duplicated shuffles. How about adding a TODO here so we don't forget?

@pabloem
Copy link
Member

pabloem commented Jan 27, 2021

There seems to be an issue with a try/catch: https://ci-beam.apache.org/job/beam_PreCommit_Java_Examples_Dataflow_Commit/12361/console

@pabloem
Copy link
Member

pabloem commented Jan 28, 2021

Run Java PostCommit

@pabloem
Copy link
Member

pabloem commented Jan 28, 2021

Run Java PreCommit

@pabloem
Copy link
Member

pabloem commented Jan 28, 2021

Postcommit from previous commit: https://ci-beam.apache.org/job/beam_PostCommit_Java_PR/558/

@pabloem pabloem merged commit f0a3f6d into apache:master Jan 28, 2021
@pabloem
Copy link
Member

pabloem commented Jan 28, 2021

Thanks @nehsyc !

nehsyc added a commit to nehsyc/beam that referenced this pull request Jan 28, 2021
…sink streaming inserts with GroupIntoBatches

* Integrate BQ streaming inserts with GroupIntoBatches

* Moved autosharding option from BigQueryOption to BigQueryIOBuilder; addressed comments.

* fix checkstyle error

* Revert the logic that was dropped during merge

* Add comments for RequiresStableInput
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants