Skip to content

Conversation

@nielm
Copy link
Contributor

@nielm nielm commented Apr 29, 2020

There is minimal benefit in separating these 2 stages, and significant
benefity in merging them: Gather and Sort encodes incoming
MutationGroups into a List<byte[]> which would contain up to 1GB.
This is then output (copied) to the CreateBatches where it is decoded
back into MutationGroups.

Removing this encode/decode should save up to 2GB of RAM.

Note, this PR is dependent on PR #11528, PR #11532 and PR #11529

Post-Commit Tests Status (on master branch)

Lang SDK Apex Dataflow Flink Gearpump Samza Spark
Go Build Status --- --- Build Status --- --- Build Status
Java Build Status Build Status Build Status
Build Status
Build Status
Build Status
Build Status
Build Status
Build Status Build Status Build Status
Build Status
Build Status
Python Build Status
Build Status
Build Status
Build Status
--- Build Status
Build Status
Build Status
Build Status
Build Status
--- --- Build Status
XLang --- --- --- Build Status --- --- Build Status

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website
Non-portable Build Status Build Status
Build Status
Build Status Build Status
Portable --- Build Status --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

@nielm nielm force-pushed the coalesceStages branch 3 times, most recently from f6a09a2 to 87d22b5 Compare May 1, 2020 12:01
@allenpradeep
Copy link
Contributor

This is great niel. With these changes, there are 3 modes of using SpannerIO write.
a) Use the conventional way(as it was till now) with a grouping factor where data is grouped, sorted, batched and written as per parameters
b) Batching without grouping - Set grouping factor as 1 with a larger batched bytes or cells. This will just ensure data is just batched without sort.
c) No Batching - Set any of the max rows or max mutations or batch bytes to 0 or 1.

Questions:

  1. What mode should our import pipeline use? Should it use option b as data in AVRO seems already sorted?
  2. Where should we document these modes of operation so that some customer can use these?

@allenpradeep
Copy link
Contributor

I'm good with these changes except the questions I had regarding the usages.
LGTM

@allenpradeep
Copy link
Contributor

Hi Niel,
I see a bunch of unit tests failing on this commit.
I am working on a patch on top of this and i noticed this.

@nielm nielm force-pushed the coalesceStages branch from 87d22b5 to 98c0aaf Compare May 18, 2020 23:00
@nielm
Copy link
Contributor Author

nielm commented May 18, 2020

@allenpradeep

  1. What mode should our import pipeline use? Should it use option b as data in AVRO seems already sorted?

We can discuss this outside the scope of this PR.

  1. Where should we document these modes of operation so that some customer can use these?

I have added a section to the javadoc explaining these 3 modes of operation, and their pros and cons.

@nielm nielm force-pushed the coalesceStages branch 2 times, most recently from ac119f0 to 8f94438 Compare May 18, 2020 23:43
@TheNeuralBit
Copy link
Member

Retest this please

@nielm nielm force-pushed the coalesceStages branch from 8f94438 to fcf4a1c Compare May 19, 2020 12:05
@nielm
Copy link
Contributor Author

nielm commented May 19, 2020

Retest this please

2 similar comments
@TheNeuralBit
Copy link
Member

Retest this please

@TheNeuralBit
Copy link
Member

Retest this please

@TheNeuralBit TheNeuralBit changed the title [BEAM-9822] Merge the stages 'Gather and Sort' and 'Create Batches' [BEAM-10047] Merge the stages 'Gather and Sort' and 'Create Batches' May 20, 2020
nielm added 2 commits June 10, 2020 10:41
There is minimal benefit in separating these 2 stages, and significant
benefity in merging them: Gather and Sort encodes incoming
MutationGroups into a List<byte[]> which would contain up to 1GB.
This is then output (copied) to the CreateBatches where it is decoded
back into MutationGroups.

Removing this encode/decode should save up to 2GB of RAM.
@nielm
Copy link
Contributor Author

nielm commented Jun 10, 2020

Retest this please

1 similar comment
@TheNeuralBit
Copy link
Member

Retest this please

@udim
Copy link
Member

udim commented Jun 23, 2020

Is this ready to merge?

@tvalentyn
Copy link
Contributor

Run Java PreCommit

@tvalentyn
Copy link
Contributor

[CheckStyle] Attaching ResultAction with ID 'checkstyle' to run 'beam_PreCommit_Java_Commit #11858'.
Setting status of a74866ba56d92d9476006b7e40a0e0ff916748ca to FAILURE with url https://ci-beam.apache.org/job/beam_PreCommit_Java_Commit/11858/ and message: 'Build finished. '
Using context: Java ("Run Java PreCommit")
Finished: ABORTED

Can't tell if tests passed or not, rerunning.

@allenpradeep
Copy link
Contributor

allenpradeep commented Jun 24, 2020

Can we merge this PR? I would want to send out a PR to count bytes written to spanner and that would be dependent on this.

@nielm
Copy link
Contributor Author

nielm commented Jun 25, 2020

Retest this please

@chamikaramj
Copy link
Contributor

Run Java PostCommit

@chamikaramj
Copy link
Contributor

Thanks. We can merge if post-commit tests pass.

@udim udim merged commit d7450bb into apache:master Jun 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants