We have just launched our brand new superheroes game and have been collecting the user stats from the last few days. This repo contains some of the initial raw data. We need to run some basic processing on the data in preperation for analysis. You are welcome to use any tecnology, most people approach this with Jupyter notebooks and pandas.
Using the small CSV datafile. Write a simple python program that processes the data and writes the outout to another CSV file. Be sure to exclude people with nonsensical data. Processing includes:
- Anonymise any personal data for each person
- Add unique ID's for each person
- Calculate the age of each person
- Some peoples age may be incorrect or impossible, we should filter these out
- Test the program against the larger CSV file
The output should look something like:
| UID | First name | Last name | Address | Age |
|---|---|---|---|---|
| 1234 | xxxx | xxxx | xxx xxx | 31 |
Following on from this discuss how you would implement this in a cloud environment or data lake and what you would productionize the code.
Finally calculate the median average & 95th percentile age of our playerbase from the large processed dataset