[BEAM-10047] Merge the stages 'Gather and Sort' and 'Create Batches' #11570

nielm · 2020-04-29T15:19:49Z

There is minimal benefit in separating these 2 stages, and significant
benefity in merging them: Gather and Sort encodes incoming
MutationGroups into a List<byte[]> which would contain up to 1GB.
This is then output (copied) to the CreateBatches where it is decoded
back into MutationGroups.

Removing this encode/decode should save up to 2GB of RAM.

Note, this PR is dependent on PR #11528, PR #11532 and PR #11529

Post-Commit Tests Status (on master branch)

Lang	SDK	Apex	Dataflow	Gearpump	Samza
Go		---	---	---	---
Java
Python		---		---	---
XLang	---	---	---	---	---

Pre-Commit Tests Status (on master branch)

---	Java	Python	Go	Website
Non-portable
Portable	---		---	---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

allenpradeep · 2020-05-01T21:42:59Z

This is great niel. With these changes, there are 3 modes of using SpannerIO write.
a) Use the conventional way(as it was till now) with a grouping factor where data is grouped, sorted, batched and written as per parameters
b) Batching without grouping - Set grouping factor as 1 with a larger batched bytes or cells. This will just ensure data is just batched without sort.
c) No Batching - Set any of the max rows or max mutations or batch bytes to 0 or 1.

Questions:

What mode should our import pipeline use? Should it use option b as data in AVRO seems already sorted?
Where should we document these modes of operation so that some customer can use these?

allenpradeep · 2020-05-06T17:47:09Z

I'm good with these changes except the questions I had regarding the usages.
LGTM

...ava/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/SpannerIO.java

allenpradeep · 2020-05-07T03:12:06Z

Hi Niel,
I see a bunch of unit tests failing on this commit.
I am working on a patch on top of this and i noticed this.

nielm · 2020-05-18T23:33:34Z

@allenpradeep

What mode should our import pipeline use? Should it use option b as data in AVRO seems already sorted?

We can discuss this outside the scope of this PR.

Where should we document these modes of operation so that some customer can use these?

I have added a section to the javadoc explaining these 3 modes of operation, and their pros and cons.

TheNeuralBit · 2020-05-18T23:58:57Z

Retest this please

nielm · 2020-05-19T12:05:41Z

Retest this please

TheNeuralBit · 2020-05-19T15:31:53Z

Retest this please

TheNeuralBit · 2020-05-19T15:38:04Z

Retest this please

There is minimal benefit in separating these 2 stages, and significant benefity in merging them: Gather and Sort encodes incoming MutationGroups into a List<byte[]> which would contain up to 1GB. This is then output (copied) to the CreateBatches where it is decoded back into MutationGroups. Removing this encode/decode should save up to 2GB of RAM.

nielm · 2020-06-10T08:46:25Z

Retest this please

TheNeuralBit · 2020-06-12T19:18:11Z

Retest this please

udim · 2020-06-23T18:01:16Z

Is this ready to merge?

tvalentyn · 2020-06-24T02:10:42Z

Run Java PreCommit

tvalentyn · 2020-06-24T02:11:58Z

[CheckStyle] Attaching ResultAction with ID 'checkstyle' to run 'beam_PreCommit_Java_Commit #11858'.
Setting status of a74866ba56d92d9476006b7e40a0e0ff916748ca to FAILURE with url https://ci-beam.apache.org/job/beam_PreCommit_Java_Commit/11858/ and message: 'Build finished. '
Using context: Java ("Run Java PreCommit")
Finished: ABORTED

Can't tell if tests passed or not, rerunning.

allenpradeep · 2020-06-24T23:07:04Z

Can we merge this PR? I would want to send out a PR to count bytes written to spanner and that would be dependent on this.

nielm · 2020-06-25T09:45:23Z

Retest this please

chamikaramj · 2020-06-26T19:39:06Z

Run Java PostCommit

chamikaramj · 2020-06-26T19:40:01Z

Thanks. We can merge if post-commit tests pass.

probot-autolabeler bot added gcp io java labels Apr 29, 2020

nielm force-pushed the coalesceStages branch 3 times, most recently from f6a09a2 to 87d22b5 Compare May 1, 2020 12:01

allenpradeep approved these changes May 6, 2020

View reviewed changes

...ava/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/SpannerIO.java Outdated Show resolved Hide resolved

nielm force-pushed the coalesceStages branch from 87d22b5 to 98c0aaf Compare May 18, 2020 23:00

nielm force-pushed the coalesceStages branch 2 times, most recently from ac119f0 to 8f94438 Compare May 18, 2020 23:43

nielm force-pushed the coalesceStages branch from 8f94438 to fcf4a1c Compare May 19, 2020 12:05

TheNeuralBit changed the title ~~[BEAM-9822] Merge the stages 'Gather and Sort' and 'Create Batches'~~ [BEAM-10047] Merge the stages 'Gather and Sort' and 'Create Batches' May 20, 2020

nielm added 2 commits June 10, 2020 10:41

Add additional documentation on Batching and Grouping

a74866b

nielm force-pushed the coalesceStages branch from fcf4a1c to a74866b Compare June 10, 2020 08:46

udim merged commit d7450bb into apache:master Jun 26, 2020

[BEAM-10047] Merge the stages 'Gather and Sort' and 'Create Batches' #11570

[BEAM-10047] Merge the stages 'Gather and Sort' and 'Create Batches' #11570

Uh oh!

Conversation

nielm commented Apr 29, 2020

Post-Commit Tests Status (on master branch)

Pre-Commit Tests Status (on master branch)

Uh oh!

allenpradeep commented May 1, 2020

Uh oh!

allenpradeep commented May 6, 2020

Uh oh!

Uh oh!

allenpradeep commented May 7, 2020

Uh oh!

nielm commented May 18, 2020

Uh oh!

TheNeuralBit commented May 18, 2020

Uh oh!

nielm commented May 19, 2020

Uh oh!

TheNeuralBit commented May 19, 2020

Uh oh!

TheNeuralBit commented May 19, 2020

Uh oh!

nielm commented Jun 10, 2020

Uh oh!

TheNeuralBit commented Jun 12, 2020

Uh oh!

udim commented Jun 23, 2020

Uh oh!

tvalentyn commented Jun 24, 2020

Uh oh!

tvalentyn commented Jun 24, 2020

Uh oh!

allenpradeep commented Jun 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nielm commented Jun 25, 2020

Uh oh!

chamikaramj commented Jun 26, 2020

Uh oh!

chamikaramj commented Jun 26, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

allenpradeep commented Jun 24, 2020 •

edited

Loading