Thunderain is a Real-Time Analytical Processing (RTAP) example using Spark and Shark, which can be best characterized by the following four salient properties:
- Data continuously streamed in & processed in near real-time
- Real-time data queried and presented in an online fashion
- Real-time and history data combined and mined interactively
- Predominantly RAM-based processing
For more details, please refer to our presentation at the AMPLab retreat in May 2013.
The Thunderain example provide a basic RTAP framework that
- Allows multiple application (App) to be defined, each of which is bound to a Kafka topic
- Fetches data streamed in from the kafka message queue
- Parses the data stream and then processes the parsed data for counting & aggregation (similar to RainBird) using Spark Streaming
- Outputs the processed results to a cached table, which can then be queried through Shark
To define an App, the user need to specify
- The parser (implementing
AbstractEventParser) to parse the data stream; several parsers (e.g.,ClickEventParserandWebLogParser) are provided in the example - One or more jobs, each of which
- Performs an operation (implementing both
AbstractOperatorandOperatorConfig) on the streaming data; several operators (e.g.,CountOperator,AggregateOperatorandDistinctAggregateCountOperator) are provided in the example - Writes the processed results using an outputer (implementing
AbstractEventOutput); several outputers (e.g.,StdEventOutput,TableRDDOutputandTachyonRDDOutput) are provided in the example
- Performs an operation (implementing both
For more details, please refer to wiki.
The Thunderain example provides two RTAP applications (i.e., clickstream and weblog), as defined in conf/properties.xml. They have been tested at our internal Spark/Shark deployment (which are available at https://github.com/thunderain-project/spark and https://github.com/thunderain-project/shark). To run the applications, one needs to
- Build the project by
sbt package - Configure related properties (e.g., log4j, Spark fairScheduler, etc.) in the
conf/directory - Launch the framework by
run thunderainproject.thunderain.framework.Thunderain <config file list>
The Thunderain example is open sourced under Apache License Version 2.0.