GitHub - guang/stargazer: Project for Data Engineering Fellowship at Insight Data Science

Project for Data Engineering Fellowship at Insight Data Science '15A

Questions and comments welcome at gy8 [AT] berkeley [DOT] edu

Overview

Stargazer is a data pipeline that serves up aggregated map statistics from a massive amount of player-submitted replay files - leveraging Kafka for ingestion, Avro for serialization, Spark Streaming for stream processing, Spark SQL for batch processing, Cassandra for data storage and Flask for front-end API.

Getting Started

Usage

Before you begin, I assume you have:

set up a hadoop cluster where you run ingestion and processing (Kafka, HDFS, Spark) after installing required dependencies (pretty much everything except Cassandra)
set up a cassandra cluster (you do not need the dependencies you have installed for the hadoop cluster)
cloned this repo on your master node in the hadoop cluster

To start the Kafka producer that pulls recent matches from GGtracker API:

ssh into the master node of your hadoop cluster

```
$ ssh -i ~/.ssh/YO_PEM_KEY.pem ubuntu@ec2-XX-XX-XX-X.us-west-1.compute.amazonaws.com
```

double check zookeeper is running

```
$ cd /opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/bin
$ sudo ./zookeeper-server status
```

start up Kafka server (you should run this in the background via tmux as it will lock your terminal)

```
$ ./stargazer/codes/ingestion/start_kafka_server.sh
```

start running the Kafka producer

```
$ python stargazer/codes/ingestion/produce_recent_matches.py
```

To start Spark Streaming:

ssh into the master node of your hadoop cluster

```
$ ssh -i ~/.ssh/YO_PEM_KEY.pem ubuntu@ec2-XX-XX-XX-X.us-west-1.compute.amazonaws.com
```

change directory into SparkStreaming

```
$ cd stargazer/codes/streaming/StreamProcessing
```

package dependencies:

```
$ sbt assembly
```

do a little compilin':

```
$ sbt package
```

spark-submit and hope you do not see 3 pages of error messages start flowing:

```
$ ./run_streamprocessing.sh
```

To do batch processing in Spark:

ssh into the master node of your hadoop cluster

```
$ ssh -i ~/.ssh/YO_PEM_KEY.pem ubuntu@ec2-XX-XX-XX-X.us-west-1.compute.amazonaws.com
```

package dependencies, compile and spark-submit

```
$ sbt assembly
$ sbt package
$ ./run_sparkcassie.sh
```

To check what is going on in your Cassandra:

ssh into the seed node (node0 by default) of your cassandra cluster

```
$ ssh -i ~/.ssh/YO_PEM_KEY.pem ubuntu@ec2-XX-XX-XX-X.us-west-1.compute.amazonaws.com
```

fire up cassandra cql shell

```
$ cqlsh
```

create or view keyspaces and tables using CQL commands

Dependencies

Distribution:

CDH (Cloudera Distribution Including Apache Hadoop): 5.3.0

Technologies:

Spark (Spark SQL and Spark Streaming): 1.2.0
Hadoop: 2.5.0
Cassandra: 2.1.2
Avro: 1.7.7

Third-party Libraries:

spark-cassandra-connector: 1.2.0
spark-avro: 0.1

Background

Starcraft II is a real-time strategy game released by Blizzard Entertainment in 2010. In addition to having a large user base with 300,000 active players world wide, Starcraft II has an exciting competitive scene - Just this last year, the finals of one of the top tournaments, the World Champion Series, were broadcasted live on ESPN (not to mention the $1.6 million total prize pool).

Gameplay centers around 1v1 matches: where each player builds bases to gather resources in order to produce armies to eliminate his/her opponent. Players compete on the ladder system for higher ranks. Many professional gamers (most play for sponsored teams) were discovered for very high ranking on the ladder.

Motivation

Map is an integral part of the Starcraft II gaming experience. All ranked ladder games are played on a different set of maps each season. Many of these ladder maps are used in international tournaments.

Designing a great map is inherently difficult: on one hand it has to offer complexity with features that give an advantage to skilled players who really understands the game dynamics; on the other hand it needs to be balanced across different play styles, such as:

Races (every player can choose to be either Terran, Protoss, or Zerg, each race with its unique units and strategies)
Strategies (some players opt for early rush, some aim to out last their opponent in long, dragged out games)

It is especially difficult to measure how balanced a map is - rounds and rounds of beta testing simply does not capture the whole picture. Currently, selecting which maps to be kept for the new season is largely a qualitative process, where users vote on the maps that they like. This naturally led to the question that I attempt to address with this project: What does game data reveal about map balance?

Data

All the data (JSON) come from API calls made against ggtracker.com. Specifically, there are 3 types of data we can extract from each match submitted by ggtracker users:

meta-data about the match (match id, map name, winner, when the match ended, match duration, etc)
snapshot of the game state taken every 10 seconds (current mineral/gas count, supply count, etc)
game events triggered by players at different times (upgrades, scouting, aggressions, etc)

Historical

For historical data, the above information are stored in

simple details (meta-data)
extended details (10 second snapshots and events triggered)

There are currently around 5.8 million matches available from the GGtracker API. At around 100 KB for each match detail files, there are around 500 GB data total available. Due to API call restrictions, I currently have only a small fraction of the total data and is continuously pulling more.

Realtime

In terms of realtime data, the API does not offer real time streaming of the data but instead gives a list of the 10 most recent matches submitted when you make the API call.

Pipeline

Here is an overview of the technologies used in my data pipeline:

Notice that this is not an implementation of Nathan Marz's lambda architecture. While computations made from streaming and batch are stored in Cassandra, they are completely different metrics and are not merged neither in the data store nor the front-end.

Ingestion

Here raw json data are mined from the GGtracker API using multiple EC2 nodes with unique IP. Data are serialized via avro in batches and pushed onto Kafka.

Batch Processing

After getting the serialized avro files onto HDFS, Spark SQL is used to directly read in the avro files as schemaRDDs. Currently the only computations being done are 1) aggregating matches by the day on which it was played on and 2) rank the top 10 most popular map. There are other (more) interesting computations that could be done but are omitted in the interest of time (track win rate by race/map, filter by league/patch).

After computations, schemaRDDs are then converted into mapped RDDs to be saved into Cassandra directly.

Stream Processing

Here and API call is made every 5 seconds to get a list of most recent matches played. This is fed directly into Kafka (without writing to disk) which gets pulled by Spark Streaming to extract the maps on which these most recent matches were played on.

This data is then saved into Cassandra directly.

Name		Name	Last commit message	Last commit date
Latest commit History 141 Commits
codes		codes
data		data
docs		docs
images		images
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Table of Contents

Overview

Getting Started

Usage

Before you begin, I assume you have:

To start the Kafka producer that pulls recent matches from GGtracker API:

To start Spark Streaming:

To do batch processing in Spark:

To check what is going on in your Cassandra:

Dependencies

Background

Motivation

Data

Historical

Realtime

Pipeline

Ingestion

Batch Processing

Stream Processing

API

About

Uh oh!

Releases

Packages

Languages

guang/stargazer

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Overview

Getting Started

Usage

Before you begin, I assume you have:

To start the Kafka producer that pulls recent matches from GGtracker API:

To start Spark Streaming:

To do batch processing in Spark:

To check what is going on in your Cassandra:

Dependencies

Background

Motivation

Data

Historical

Realtime

Pipeline

Ingestion

Batch Processing

Stream Processing

API

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages