[BEAM-115] Remove GroupByKey expansion, invoke it on a per-runner basis. #77

kennknowles · 2016-03-25T02:21:17Z

As per the Runner API design this makes GroupByKey very explicitly a primitive, and moves the subsidiary primitives to top level classes in the util/ (aka miscellaneous) directory, eventually to move to some appropriate final location for "runner utilities".

kennknowles · 2016-03-25T02:26:01Z

R: @amitsela there's some incidental import reordering in the Spark code. I used Eclipse, which sorted the imports, actually in a way that broke the Spark runner's checkstyle. I didn't fix it up, only because checkstyle let me get by with things the way they are... :-)
R: @mxm the changes to the Flink runner are, I hope, reasonably uninteresting :-)
R: @dhalperi please review from the Dataflow side of things.

If anyone would like to me to undo the automatic whitespace smooshing my IDE did, I am happy to.

And, of course, everyone should feel free to comment on the overall change. There should be no observable behavioral change. I did rely somewhat on unit tests to catch anything egregious, so if there is a lack of coverage I could have missed an issue.

davorbonaci · 2016-03-28T05:13:56Z

@lukecwik and @tgroh might be interested too.

tgroh · 2016-03-28T18:12:22Z

runners/spark/src/main/java/org/apache/beam/runners/spark/translation/TransformTranslator.java

    EVALUATORS.put(ParDo.Bound.class, parDo());
    EVALUATORS.put(ParDo.BoundMulti.class, multiDo());
-    EVALUATORS.put(GroupByKey.GroupByKeyOnly.class, gbk());
+    EVALUATORS.put(GroupByKeyOnly.class, gbk());


Given that this also removes the default GroupByKey expansion, will this work without adding a runner-specific expansion?

Good point. I didn't catch that this runner does use the expansion. I will move some bits around (more things have to be public). I will still use util/ as a temporary holding pen for the pieces for now, I think.

kennknowles · 2016-03-28T19:33:37Z

.../src/main/java/org/apache/beam/runners/flink/translation/FlinkBatchTransformTranslators.java

    }
  }

-  private static class GroupByKeyOnlyTranslatorBatch<K, V> implements FlinkBatchPipelineTranslator.BatchTransformTranslator<GroupByKey.GroupByKeyOnly<K, V>> {


I removed this, but I admit to not being sure what the intent is. There is a translator for the whole GroupByKey so this should have been dead code. On the other hand, the translator translates GroupByKey to GroupByKeyOnly so perhaps it would be better to use the expanded form, like Spark and the DirectPipelineRunner.

LMK

GroupByKey expands not only into GroupByKeyOnly but also does the Windowing and timestamp assignment. In early Dataflow versions, this used to be different. When the changes came, we introduced an additional translator to skip the Window assignment. I would leave it as it is for now and do an immediate follow-up pull request where we get rid of this artifact. IMHO the GroupByKeyOnly translator should stay and GroupByKey should be removed.

Ah, had a second look. GroupByKey has been removed. Should be good to merge as-it-is then.

kennknowles · 2016-03-28T19:38:22Z

Please take another look. When merged, the commit title can be changed to reflect the new structure: the expansion is just moved to the side.

dhalperi · 2016-03-28T20:11:02Z

Tests don't pass.

tgroh · 2016-03-28T20:30:32Z

.../java/core/src/main/java/com/google/cloud/dataflow/sdk/util/GroupByKeyViaGroupByKeyOnly.java

+ * <p>This implementation of {@link GroupByKey} proceeds by reifying windows and timestamps (making
+ * them part of the element rather than metadata), performing a {@link GroupByKeyOnly} primitive,
+ * then using a {@link GroupAlsoByWindow} transform to further group the resulting elements by
+ * window.


This notably only functions as a composite if the input PCollection is Bounded, due to the choice of implementation for GroupAlsoByWindows; Additionally, it makes assumptions about the form in which it will recieve per-key input at the point of GroupAlsoByWindow (namely the entire Iterable<T> for each key), so it is not a general implementation.

Noted in javadoc.

kennknowles · 2016-03-28T22:12:51Z

Fixed up the tests.

kennknowles · 2016-03-31T17:00:44Z

@amitsela any comment on expanding GBK in the spark runner? It should leave the behavior exactly as it was before. This is the intended method of runner-specific replacements until the new pipeline transformation API is ready.

lukecwik · 2016-03-31T23:37:22Z

@davorbonaci Will pass on this PR.

dhalperi · 2016-04-05T00:36:22Z

.../java/core/src/main/java/com/google/cloud/dataflow/sdk/util/GroupByKeyViaGroupByKeyOnly.java

+ *   <li>{@code SortValuesByTimestamp ParDo(SortValuesByTimestamp)}: The values in the iterables
+ *       output by {@link GroupByKeyOnly} are sorted by timestamp.</li>
+ *   <li>{@code GroupAlsoByWindow}: This primitive processes the sorted values. Today it is
+ *       implemented as a {@link ParDo} that calls reserved internal methods.</li>


either all link or all code?

dhalperi · 2016-04-05T18:38:38Z

LGTM

apache#77 Import trends benchmarks

* fix: avoid using read as transaction can get stuck There is a bug in the current java-spanner client library used, where if specific conditions are met, a transaction might get stuck while performing a read call. This has been fixed in later versions of the client library, but Apache Beam still uses a version without the fix. In order to work around the issue, we do not use any read calls, but instead do the same with a streaming execute query instead. * feat: add todo to rollback to read in dao Adds todo to rollback to use read when java-spanner library is updated to contain the fix.

kennknowles force-pushed the GBK branch from ea650b2 to 589ef8a Compare March 25, 2016 02:23

kennknowles changed the title ~~[BEAM-] Move GroupByKey expansion into DirectPipelineRunner~~ [BEAM-115] Move GroupByKey expansion into DirectPipelineRunner Mar 25, 2016

tgroh reviewed Mar 28, 2016
View reviewed changes

kennknowles force-pushed the GBK branch from 57a51ea to bcc010c Compare March 28, 2016 19:30

kennknowles reviewed Mar 28, 2016
View reviewed changes

kennknowles changed the title ~~[BEAM-115] Move GroupByKey expansion into DirectPipelineRunner~~ [BEAM-115] Remove GroupByKey expansion, invoke it on a per-runner basis. Mar 28, 2016

tgroh reviewed Mar 28, 2016
View reviewed changes

dhalperi reviewed Apr 5, 2016
View reviewed changes

kennknowles closed this Apr 5, 2016

kennknowles force-pushed the GBK branch from 52cd18a to 4f99635 Compare April 5, 2016 21:35

asfgit pushed a commit that referenced this pull request Apr 5, 2016

This closes #77

6c34f3a

asfgit pushed a commit that referenced this pull request Apr 5, 2016

Fix ups of merge of #77

c26eef5

kennknowles deleted the GBK branch April 19, 2016 17:23

iemejia referenced this pull request in iemejia/beam Jan 12, 2018

This closes #77

d5b722e

mareksimunek pushed a commit to mareksimunek/beam that referenced this pull request May 9, 2018

apache#77 Import trends benchmarks

8b89901

mareksimunek pushed a commit to mareksimunek/beam that referenced this pull request May 9, 2018

apache#77 Drop obsolete shell script

98c267f

mareksimunek pushed a commit to mareksimunek/beam that referenced this pull request May 9, 2018

apache#77 - parsing fix

3295733

mareksimunek pushed a commit to mareksimunek/beam that referenced this pull request May 9, 2018

Merge pull request apache#88 from seznam/pete/import-benmarks

0fddf45

apache#77 Import trends benchmarks

dmvk pushed a commit to dmvk/beam that referenced this pull request May 15, 2018

apache#77 Import trends benchmarks

7039e3c

dmvk pushed a commit to dmvk/beam that referenced this pull request May 15, 2018

apache#77 Drop obsolete shell script

4ea1acf

dmvk pushed a commit to dmvk/beam that referenced this pull request May 15, 2018

apache#77 - parsing fix

093c34b

mareksimunek pushed a commit to seznam/beam that referenced this pull request Jul 9, 2018

apache#77 Import trends benchmarks

8a37a34

mareksimunek pushed a commit to seznam/beam that referenced this pull request Jul 9, 2018

apache#77 Drop obsolete shell script

55e13b8

mareksimunek pushed a commit to seznam/beam that referenced this pull request Jul 9, 2018

apache#77 - parsing fix

e2651a5

dmvk pushed a commit to seznam/beam that referenced this pull request Aug 17, 2018

apache#77 Import trends benchmarks

8800a29

dmvk pushed a commit to seznam/beam that referenced this pull request Aug 17, 2018

apache#77 Drop obsolete shell script

29c06e9

dmvk pushed a commit to seznam/beam that referenced this pull request Aug 17, 2018

apache#77 - parsing fix

171ff6d

dmvk pushed a commit to seznam/beam that referenced this pull request Oct 5, 2018

apache#77 Import trends benchmarks

bde04a0

dmvk pushed a commit to seznam/beam that referenced this pull request Oct 5, 2018

apache#77 Drop obsolete shell script

be64ccd

dmvk pushed a commit to seznam/beam that referenced this pull request Oct 5, 2018

apache#77 - parsing fix

b7788cd

[BEAM-115] Remove GroupByKey expansion, invoke it on a per-runner basis. #77

[BEAM-115] Remove GroupByKey expansion, invoke it on a per-runner basis. #77

Uh oh!

Conversation

kennknowles commented Mar 25, 2016

Uh oh!

kennknowles commented Mar 25, 2016

Uh oh!

davorbonaci commented Mar 28, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kennknowles commented Mar 28, 2016

Uh oh!

dhalperi commented Mar 28, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kennknowles commented Mar 28, 2016

Uh oh!

kennknowles commented Mar 31, 2016

Uh oh!

lukecwik commented Mar 31, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dhalperi commented Apr 5, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants