-
Notifications
You must be signed in to change notification settings - Fork 4.5k
[BEAM-115] Remove GroupByKey expansion, invoke it on a per-runner basis. #77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
R: @amitsela there's some incidental import reordering in the Spark code. I used Eclipse, which sorted the imports, actually in a way that broke the Spark runner's checkstyle. I didn't fix it up, only because checkstyle let me get by with things the way they are... :-) If anyone would like to me to undo the automatic whitespace smooshing my IDE did, I am happy to. And, of course, everyone should feel free to comment on the overall change. There should be no observable behavioral change. I did rely somewhat on unit tests to catch anything egregious, so if there is a lack of coverage I could have missed an issue. |
| EVALUATORS.put(ParDo.Bound.class, parDo()); | ||
| EVALUATORS.put(ParDo.BoundMulti.class, multiDo()); | ||
| EVALUATORS.put(GroupByKey.GroupByKeyOnly.class, gbk()); | ||
| EVALUATORS.put(GroupByKeyOnly.class, gbk()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that this also removes the default GroupByKey expansion, will this work without adding a runner-specific expansion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I didn't catch that this runner does use the expansion. I will move some bits around (more things have to be public). I will still use util/ as a temporary holding pen for the pieces for now, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| } | ||
| } | ||
|
|
||
| private static class GroupByKeyOnlyTranslatorBatch<K, V> implements FlinkBatchPipelineTranslator.BatchTransformTranslator<GroupByKey.GroupByKeyOnly<K, V>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this, but I admit to not being sure what the intent is. There is a translator for the whole GroupByKey so this should have been dead code. On the other hand, the translator translates GroupByKey to GroupByKeyOnly so perhaps it would be better to use the expanded form, like Spark and the DirectPipelineRunner.
LMK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GroupByKey expands not only into GroupByKeyOnly but also does the Windowing and timestamp assignment. In early Dataflow versions, this used to be different. When the changes came, we introduced an additional translator to skip the Window assignment. I would leave it as it is for now and do an immediate follow-up pull request where we get rid of this artifact. IMHO the GroupByKeyOnly translator should stay and GroupByKey should be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, had a second look. GroupByKey has been removed. Should be good to merge as-it-is then.
|
Please take another look. When merged, the commit title can be changed to reflect the new structure: the expansion is just moved to the side. |
|
Tests don't pass. |
| * <p>This implementation of {@link GroupByKey} proceeds by reifying windows and timestamps (making | ||
| * them part of the element rather than metadata), performing a {@link GroupByKeyOnly} primitive, | ||
| * then using a {@link GroupAlsoByWindow} transform to further group the resulting elements by | ||
| * window. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This notably only functions as a composite if the input PCollection is Bounded, due to the choice of implementation for GroupAlsoByWindows; Additionally, it makes assumptions about the form in which it will recieve per-key input at the point of GroupAlsoByWindow (namely the entire Iterable<T> for each key), so it is not a general implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted in javadoc.
|
Fixed up the tests. |
|
@amitsela any comment on expanding GBK in the spark runner? It should leave the behavior exactly as it was before. This is the intended method of runner-specific replacements until the new pipeline transformation API is ready. |
|
@davorbonaci Will pass on this PR. |
| * <li>{@code SortValuesByTimestamp ParDo(SortValuesByTimestamp)}: The values in the iterables | ||
| * output by {@link GroupByKeyOnly} are sorted by timestamp.</li> | ||
| * <li>{@code GroupAlsoByWindow}: This primitive processes the sorted values. Today it is | ||
| * implemented as a {@link ParDo} that calls reserved internal methods.</li> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
either all link or all code?
|
LGTM |
apache#77 Import trends benchmarks
* fix: avoid using read as transaction can get stuck There is a bug in the current java-spanner client library used, where if specific conditions are met, a transaction might get stuck while performing a read call. This has been fixed in later versions of the client library, but Apache Beam still uses a version without the fix. In order to work around the issue, we do not use any read calls, but instead do the same with a streaming execute query instead. * feat: add todo to rollback to read in dao Adds todo to rollback to use read when java-spanner library is updated to contain the fix.
As per the Runner API design this makes GroupByKey very explicitly a primitive, and moves the subsidiary primitives to top level classes in the
util/(aka miscellaneous) directory, eventually to move to some appropriate final location for "runner utilities".