Skip to content

Conversation

@kennknowles
Copy link
Member

@kennknowles kennknowles commented May 3, 2016

Be sure to do all of the following to help us incorporate your contribution
quickly and easily:

  • Make sure the PR title is formatted like:
    [BEAM-<Jira issue #>] Description of pull request
  • Make sure tests pass via mvn clean verify. (Even better, enable
    Travis-CI on your fork and ensure the whole test matrix passes).
  • Replace <Jira issue #> in the title with the actual Jira issue
    number, if there is one.
  • If this contribution is large, please file an Apache
    Individual Contributor License Agreement.

These are some moderately useful preliminaries that came out of a quick attempt to try out the RunnableOnService tests with the Flink runner. The attempt got a bit stuck, but I figured these alteration might be useful either way.

If you are curious, translation seems to just not be invoked in my configuration, which you can see in the next couple commits on my branch flink-integration

This adds a Window.Bound translator that matches Flink streaming. It
depends on deprecated privileged methods of DoFn.ProcessContext. But
this is the status quo for most runners anyhow, and we have a plan
for migrating everything away from it.

This restores windowing support to the batch runner, and opens
the door to using GroupByKeyViaGroupByKeyOnly, for which Flink already
has the needed capabilities.
The current Flink batch translation ignores windows. This is consistent
with the requirements of GroupByKeyOnly. This change ports the runner
to the expansion of GroupByKey to a GroupByKeyOnly operation followed
by a GroupAlsoByWindow operation.
@kennknowles
Copy link
Member Author

R: @mxm @aljoscha

@kennknowles
Copy link
Member Author

The missing piece is state. Obvious routes forward:

  1. Add FlinkStateInternals to batch at the needed places.
  2. Have the runner bail out on non-global windows in batch, which would allow GroupAlsoByWindow to be a no-op.

@aljoscha
Copy link
Contributor

aljoscha commented May 4, 2016

I think adding FlinkStateInternals is the way to go, especially in the long run. We should do the windowing in a GroupReduceFunction which is used like this in Flink:

DataSet in = ...
in
  .groupBy(<some key>)
  .reduceGroup(<some GroupReduceFunction>)

elements are partitioned into key-groups by sorting, then the whole group of elements with the same key is passed to the GroupReduceFunction. We would only have to instantiate one FlinkStateInternals per group and can discard it afterwards since no more elements with the same key would arrive.

@aljoscha
Copy link
Contributor

aljoscha commented May 4, 2016

This is a somewhat bigger change, though.

@kennknowles kennknowles changed the title Port batch Flink GroupByKey to GroupByKeyViaGroupByKeyOnly [BEAM-115] Port batch Flink GroupByKey to GroupByKeyViaGroupByKeyOnly May 5, 2016
@kennknowles
Copy link
Member Author

Closing this in favor of nearer-term #291.

@kennknowles kennknowles closed this May 5, 2016
@kennknowles kennknowles deleted the flink-GBK branch November 10, 2016 03:10
iemejia pushed a commit to iemejia/beam that referenced this pull request Jan 12, 2018
mareksimunek pushed a commit to mareksimunek/beam that referenced this pull request May 9, 2018
[euphoria-examples] better exception handling in SimpleWordCount and …
pl04351820 pushed a commit to pl04351820/beam that referenced this pull request Dec 20, 2023
…#278)

Restore undocumented classes after async split.

Closes apache#277

Co-authored-by: Craig Labenz <craig.labenz@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants