-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Import Spark Runner code #37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
more checkstyle improvements
…them explicitly
… 4.12, Spark 1.2; Add source, javadoc plugins and other info; Fix javadoc errors and a few typos
… outside contributions.
The primary change needed to accomodate the new dataflow api is to how we handle side inputs.
FileOutputFormats to be used with Spark Dataflow, as long as they implement the ShardNameTemplateAware interface. This is easily achieved by subclassing the desired FileOutputFormat class, see TemplatedSequenceFileOutputFormat for an example.
Support was added in Spark 1.5.0 for user exception propagation, see https://issues.apache.org/jira/browse/SPARK-8625. Fixes https://github.com/cloudera/spark-dataflow/issues/69
Add support for application name and streaming (default: false) Add pipeline options for streaming Add print output as an unbounded write Add default window strategy to represent Spark streaming micro-batches as fixed windows This translator helps to translate Dataflow transformations into Spark (+streaming) transformations. This will help to support streaming transformations separately Expose through the SparkPipelineTranslator Now Evaluator uses SparkPipelineTranslator to translate Add default application name StreamingEvaluation context to support DStream evaluation. Expose members and methods in EvaluationContext for inheritors Use configured app name in options A TransformTranslator for streaming Add support for spark streaming execution in the runner Fix comment Create input stream from a queue - mainly for testing I guess Add support to create input stream from queued values Override method to expose in package Test WordCount in streaming, just print out for now.. Stream print to console is a transformations of PCollection to PDone rename to CreateStream to differ from Dataflow Create It seems that in 1.3.1 short living streaming jobs fail (like unit tests). Maybe has something to do with SPARK-7930. fixed in 1.4.0 so bumped up. Expose some methods, add a method to check if RDDHolder exists make context final Streaming default should be local[1] to suppport unit tests No need for recurring context. Exposing additional parent methods. Added RUNNING state when stream is running. WordCount test runs 1 (sec) interval and compares to expected like in batch. Void Create triggers a no-input transformation transformations and output operations can be applied on streams/bounded collections in the pipeline foreachRDD is used for PDone transformation Commments SocketIO to consume stream from socket Comment Add support for Kafka input Comments and some patching-up Default is the same as in SparkPipelineOptions Adding licenses To satisfy license Javadoc and codestyle Satisfy license Javadoc and codestyle Check for DataflowAssertFailure because it won't propagate Since DataflowAssert doesn't propagate failures in streaming, use Aggregators to assert Use DataflowAssertStreaming Add kafka translation Embedded Kafka for unit test Kafka unit test import order license WindowingHelpers by Tom White @tomwhite Combine @tomwhite windowing branch into mine - values are windowed values now values are windowed values now Input is UNBOUNDED now Using windowing instead batchInterval to be determined by pipeline runner print the value not the windowed value remove support for for optimizations. for now. batchInterval is determined by the pipeline runner now Add streaming window pipeline visitor to determine windowing Add windowing support in streaming unit tests Combine.Globally is necessary so leave it fix line length renames Add implementation for GroupAlsoByWindow which helps to solve broken grouped/combinePerKey Line indentation unused codestyle Expose runtimeContext Make public Use the smallest window found (fixed/sliding) as the batch duration Make FieldGetter public Add support for windowing codestyle unused Update Spark to 1.5, kafka dependency should be provided Abstract Evaluator for common evaluator code. doVisitTransform per implementation. Added non-streaming windowing test by Tom White @tomwhite Fixed Combine.GroupedValues and Combine.Globally to work with WindowedValues without losing window properties. For now, Combine.PerKey is commented out until fixed to fully support WindowedValues. Support WindowedValues, Global or not, in Combine.PerKey After changes made to Combine.PerKey in 3a46150 it seems that the order has changed. Since ordere didn't seem relevant before the change, I don't see a reason not to change the expected value accordingly. Update Spark version to 1.5.2
Wrong packcage utils
…further untangle some generics issues. Update plugins. Fix some minor code issues from inspection.
|
R: @davorbonaci I'll take a quick peek. |
|
@davorbonaci feel free to merge this. I'll take care of integrating per https://issues.apache.org/jira/browse/BEAM-11 |
|
Nice! I'd probably get rid of LICENCE and CONTRIBUTING.md right away, and prefix the pull request with I can merge this right away -- no issues there. Just to confirm -- both of you should have commit/write access to the project. Is that not the case? |
|
Supposedly - but you're right, it'll be a good idea to test that.. Let me do the honors ;) |
|
Glad the snippet could help you out @tomwhite. Nice to see this going in! |
Use Read -> Impulse override utilities
* Make check-links script more reliable * Fix typos in links
* feat: make finish partition action idempotent * feat: make child partition action idempotent * fix: fix insert query in partition metadata dao * chore: spotless apply * refactor: catch error on child partition action Rely on already exists error to skip inserting a partition in the child partition record instead of checking if the key exists. We reduce the number of calls by doing this. * refactor: catch error on finish partition action Rely on catching an exception with a specific code to make the finish partition action idempotent. * refactor: removes unused dao method
This addresses https://issues.apache.org/jira/browse/BEAM-6.
I've preserved git history (using @mxm's amazing git rewriting trick from #12). This is just an initial import - the Spark runner build is not yet integrated with the main build, packages need changing, etc. That's going to take more work, so it might be a good idea to get this merged first.