Sample project that processes clickstream data using Kafka and Apache Spark.
Install scala, kafka and apache-spark using homebrew.
export JAVA_HOME="$(/usr/libexec/java_home)"
export PATH=$JAVA_HOME:$PATH
export SCALA_HOME="/usr/local/Cellar/scala/2.12.4" # find it using `brew info`
export PATH=$SCALA_HOME/bin:$PATHMake sure that /usr/local/bin is also added to your $PATH.
Use pyenv or similar to manage your python versions and virtual environments. After creating a virtual environment, install dependencies with: pip install -r requirements.txt.
To use production data, copy the CSV file into data/production.csv.
See the make commands in for running the services locally.
- Start Zookeeper:
make zookeeper - Start Kafka:
make kafka - New tab, create the
clickstreamtopic withmake create_topic(unless it already exists). - Start the simple Spark stream that monitors the
clickstreamtopic and prints the messages to the command line:make spark_read - New tab, stream some sample data to Kafka:
make sample_data - The sample data should appear in the simple stream in the previous tab.
- Make sure that your production data (a really big CSV) is found under
data/production.csv - Start importing production data with
make production_data - Start the categories stream with
make spark_categories - The categories should appear counted, with the sliding interval of 10 seconds.
- The output of the previous stream should also appear writen to the file system in the
outputdirectory.
