Skip to content

Conversation

@tomwhite
Copy link
Member

This addresses https://issues.apache.org/jira/browse/BEAM-6.

I've preserved git history (using @mxm's amazing git rewriting trick from #12). This is just an initial import - the Spark runner build is not yet integrated with the main build, packages need changing, etc. That's going to take more work, so it might be a good idea to get this merged first.

Josh Wills and others added 30 commits March 10, 2016 11:14
more checkstyle improvements
… 4.12, Spark 1.2; Add source, javadoc plugins and other info; Fix javadoc errors and a few typos
The primary change needed to accomodate the new dataflow api is to how we
handle side inputs.
tomwhite and others added 14 commits March 10, 2016 11:15
FileOutputFormats to be used with Spark Dataflow, as long as they
implement the ShardNameTemplateAware interface. This is easily achieved
by subclassing the desired FileOutputFormat class, see
TemplatedSequenceFileOutputFormat for an example.
Add support for application name and streaming (default: false)

Add pipeline options for streaming

Add print output as an unbounded write

Add default window strategy to represent Spark streaming micro-batches as fixed windows

This translator helps to translate Dataflow transformations into Spark (+streaming) transformations. This will help to support streaming transformations separately

Expose through the SparkPipelineTranslator

Now Evaluator uses SparkPipelineTranslator to translate

Add default application name

StreamingEvaluation context to support DStream evaluation. Expose members and methods in EvaluationContext
for inheritors

Use configured app name in options

A TransformTranslator for streaming

Add support for spark streaming execution in the runner

Fix comment

Create input stream from a queue - mainly for testing I guess

Add support to create input stream from queued values

Override method to expose in package

Test WordCount in streaming, just print out for now..

Stream print to console is a transformations of PCollection to PDone

rename to CreateStream to differ from Dataflow Create

It seems that in 1.3.1 short living streaming jobs fail (like unit tests). Maybe has something to do with SPARK-7930. fixed in 1.4.0 so bumped up.

Expose some methods, add a method to check if RDDHolder exists

make context final

Streaming default should be local[1] to suppport unit tests

No need for recurring context. Exposing additional parent methods. Added RUNNING state when stream is running.

WordCount test runs 1 (sec) interval and compares to expected like in batch.

Void Create triggers a no-input transformation

transformations and output operations can be applied on streams/bounded collections in the pipeline

foreachRDD is used for PDone transformation

Commments

SocketIO to consume stream from socket

Comment

Add support for Kafka input

Comments and some patching-up

Default is the same as in SparkPipelineOptions

Adding licenses

To satisfy license Javadoc and codestyle

Satisfy license Javadoc and codestyle

Check for DataflowAssertFailure because it won't propagate

Since DataflowAssert doesn't propagate failures in streaming, use Aggregators to assert

Use DataflowAssertStreaming

Add kafka translation

Embedded Kafka for unit test

Kafka unit test

import order

license

WindowingHelpers by Tom White @tomwhite

Combine @tomwhite windowing branch into mine - values are windowed values now

values are windowed values now

Input is UNBOUNDED now

Using windowing instead

batchInterval to be determined by pipeline runner

print the value not the windowed value

remove support for for optimizations. for now.

batchInterval is determined by the pipeline runner now

Add streaming window pipeline visitor to determine windowing

Add windowing support in streaming unit tests

Combine.Globally is necessary so leave it

fix line length

renames

Add implementation for GroupAlsoByWindow which helps to solve broken grouped/combinePerKey

Line indentation

unused

codestyle

Expose runtimeContext

Make public

Use the smallest window found (fixed/sliding) as the batch duration

Make FieldGetter public

Add support for windowing

codestyle

unused

Update Spark to 1.5, kafka dependency should be provided

Abstract Evaluator for common evaluator code. doVisitTransform per implementation.

Added non-streaming windowing test by Tom White @tomwhite

Fixed Combine.GroupedValues and Combine.Globally to work with WindowedValues without losing window
properties. For now, Combine.PerKey is commented out until fixed to fully support WindowedValues.

Support WindowedValues, Global or not, in Combine.PerKey

After changes made to Combine.PerKey in 3a46150 it seems that the order has changed. Since ordere
didn't seem relevant before the change, I don't see a reason not to change the expected value
accordingly.

Update Spark version to 1.5.2
…further untangle some generics issues. Update plugins. Fix some minor code issues from inspection.
@davorbonaci
Copy link
Member

R: @davorbonaci

I'll take a quick peek.

@amitsela
Copy link
Member

@davorbonaci feel free to merge this. I'll take care of integrating per https://issues.apache.org/jira/browse/BEAM-11

@davorbonaci
Copy link
Member

Nice!

I'd probably get rid of LICENCE and CONTRIBUTING.md right away, and prefix the pull request with [BEAM-6] Import....

I can merge this right away -- no issues there. Just to confirm -- both of you should have commit/write access to the project. Is that not the case?

@amitsela
Copy link
Member

Supposedly - but you're right, it'll be a good idea to test that.. Let me do the honors ;)

@asfgit asfgit merged commit 1229b00 into apache:master Mar 10, 2016
asfgit pushed a commit that referenced this pull request Mar 10, 2016
@davorbonaci
Copy link
Member

Awesome. Thanks @amitsela and @tomwhite.

@mxm
Copy link
Contributor

mxm commented Mar 11, 2016

Glad the snippet could help you out @tomwhite. Nice to see this going in!

echauchot added a commit to echauchot/beam that referenced this pull request May 12, 2017
cosmoskitten pushed a commit to cosmoskitten/beam that referenced this pull request Jun 16, 2017
Fix Runner categories in tests

Add streaming unit tests and corresponding labels
issue apache#37

Update numEvents: results are no more linked to the number of events

issue apache#22
asfgit pushed a commit that referenced this pull request Aug 23, 2017
Fix Runner categories in tests

Add streaming unit tests and corresponding labels
issue #37

Update numEvents: results are no more linked to the number of events

issue #22
lukecwik referenced this pull request in lukecwik/incubator-beam Mar 27, 2018
Use Read -> Impulse override utilities
tvalentyn pushed a commit to tvalentyn/beam that referenced this pull request May 15, 2018
robertwb pushed a commit to robertwb/incubator-beam that referenced this pull request Apr 30, 2020
* Make check-links script more reliable

* Fix typos in links
hengfengli referenced this pull request in hengfengli/beam Mar 21, 2022
* feat: make finish partition action idempotent

* feat: make child partition action idempotent

* fix: fix insert query in partition metadata dao

* chore: spotless apply

* refactor: catch error on child partition action

Rely on already exists error to skip inserting a partition in the child
partition record instead of checking if the key exists. We reduce the
number of calls by doing this.

* refactor: catch error on finish partition action

Rely on catching an exception with a specific code to make the finish
partition action idempotent.

* refactor: removes unused dao method
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants