Project for Data Engineering Fellowship at Insight Data Science '15A
Questions and comments welcome at gy8 [AT] berkeley [DOT] edu
Stargazer is a data pipeline that serves up aggregated map statistics from a massive amount of player-submitted replay files - leveraging Kafka for ingestion, Avro for serialization, Spark Streaming for stream processing, Spark SQL for batch processing, Cassandra for data storage and Flask for front-end API.
-
set up a hadoop cluster where you run ingestion and processing (Kafka, HDFS, Spark) after installing required dependencies (pretty much everything except Cassandra)
-
set up a cassandra cluster (you do not need the dependencies you have installed for the hadoop cluster)
-
cloned this repo on your master node in the hadoop cluster
- ssh into the master node of your hadoop cluster
```
$ ssh -i ~/.ssh/YO_PEM_KEY.pem ubuntu@ec2-XX-XX-XX-X.us-west-1.compute.amazonaws.com
```
- double check zookeeper is running
```
$ cd /opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/bin
$ sudo ./zookeeper-server status
```
- start up Kafka server (you should run this in the background via tmux as it will lock your terminal)
```
$ ./stargazer/codes/ingestion/start_kafka_server.sh
```
- start running the Kafka producer
```
$ python stargazer/codes/ingestion/produce_recent_matches.py
```
- ssh into the master node of your hadoop cluster
```
$ ssh -i ~/.ssh/YO_PEM_KEY.pem ubuntu@ec2-XX-XX-XX-X.us-west-1.compute.amazonaws.com
```
- change directory into SparkStreaming
```
$ cd stargazer/codes/streaming/StreamProcessing
```
- package dependencies:
```
$ sbt assembly
```
- do a little compilin':
```
$ sbt package
```
- spark-submit and hope you do not see 3 pages of error messages start flowing:
```
$ ./run_streamprocessing.sh
```
- ssh into the master node of your hadoop cluster
```
$ ssh -i ~/.ssh/YO_PEM_KEY.pem ubuntu@ec2-XX-XX-XX-X.us-west-1.compute.amazonaws.com
```
- package dependencies, compile and spark-submit
```
$ sbt assembly
$ sbt package
$ ./run_sparkcassie.sh
```
- ssh into the seed node (node0 by default) of your cassandra cluster
```
$ ssh -i ~/.ssh/YO_PEM_KEY.pem ubuntu@ec2-XX-XX-XX-X.us-west-1.compute.amazonaws.com
```
- fire up cassandra cql shell
```
$ cqlsh
```
- create or view keyspaces and tables using CQL commands
Distribution:
- CDH (Cloudera Distribution Including Apache Hadoop): 5.3.0
Technologies:
-
Spark (Spark SQL and Spark Streaming): 1.2.0
-
Hadoop: 2.5.0
-
Cassandra: 2.1.2
-
Avro: 1.7.7
Third-party Libraries:
-
spark-cassandra-connector: 1.2.0
-
spark-avro: 0.1
Starcraft II is a real-time strategy game released by Blizzard Entertainment in 2010. In addition to having a large user base with 300,000 active players world wide, Starcraft II has an exciting competitive scene - Just this last year, the finals of one of the top tournaments, the World Champion Series, were broadcasted live on ESPN (not to mention the $1.6 million total prize pool).
Gameplay centers around 1v1 matches: where each player builds bases to gather resources in order to produce armies to eliminate his/her opponent. Players compete on the ladder system for higher ranks. Many professional gamers (most play for sponsored teams) were discovered for very high ranking on the ladder.
Map is an integral part of the Starcraft II gaming experience. All ranked ladder games are played on a different set of maps each season. Many of these ladder maps are used in international tournaments.
Designing a great map is inherently difficult: on one hand it has to offer complexity with features that give an advantage to skilled players who really understands the game dynamics; on the other hand it needs to be balanced across different play styles, such as:
- Races (every player can choose to be either Terran, Protoss, or Zerg, each race with its unique units and strategies)
- Strategies (some players opt for early rush, some aim to out last their opponent in long, dragged out games)
It is especially difficult to measure how balanced a map is - rounds and rounds of beta testing simply does not capture the whole picture. Currently, selecting which maps to be kept for the new season is largely a qualitative process, where users vote on the maps that they like. This naturally led to the question that I attempt to address with this project: What does game data reveal about map balance?
All the data (JSON) come from API calls made against ggtracker.com. Specifically, there are 3 types of data we can extract from each match submitted by ggtracker users:
-
meta-data about the match (match id, map name, winner, when the match ended, match duration, etc)
-
snapshot of the game state taken every 10 seconds (current mineral/gas count, supply count, etc)
-
game events triggered by players at different times (upgrades, scouting, aggressions, etc)
For historical data, the above information are stored in
- simple details (meta-data)
- extended details (10 second snapshots and events triggered)
There are currently around 5.8 million matches available from the GGtracker API. At around 100 KB for each match detail files, there are around 500 GB data total available. Due to API call restrictions, I currently have only a small fraction of the total data and is continuously pulling more.
In terms of realtime data, the API does not offer real time streaming of the data but instead gives a list of the 10 most recent matches submitted when you make the API call.
Here is an overview of the technologies used in my data pipeline:
Notice that this is not an implementation of Nathan Marz's lambda architecture. While computations made from streaming and batch are stored in Cassandra, they are completely different metrics and are not merged neither in the data store nor the front-end.
Here raw json data are mined from the GGtracker API using multiple EC2 nodes with unique IP. Data are serialized via avro in batches and pushed onto Kafka.
After getting the serialized avro files onto HDFS, Spark SQL is used to directly read in the avro files as schemaRDDs. Currently the only computations being done are 1) aggregating matches by the day on which it was played on and 2) rank the top 10 most popular map. There are other (more) interesting computations that could be done but are omitted in the interest of time (track win rate by race/map, filter by league/patch).
After computations, schemaRDDs are then converted into mapped RDDs to be saved into Cassandra directly.
Here and API call is made every 5 seconds to get a list of most recent matches played. This is fed directly into Kafka (without writing to disk) which gets pulled by Spark Streaming to extract the maps on which these most recent matches were played on.
This data is then saved into Cassandra directly.

