Junky is a WiP started at PyGotham 2014 intended to one day become a tool to quickly identify potential problems in the distribution and normalization of datasets.
- Pandas, numpy, etc
Dataset Size
- How many records? How much space do we have to examine subgroups?
Normalization of Variables
- In columns with string types does the data need to be cleaned to normalize categories?
Likelihood of Missing Data
- Percentage of Null values for each column
- Rows with missing columns
Normal Distributions
- Z-scores, T-tests, F-tests, heteroskedasticity, and box-jenkins test
- Max, Min, Median, Mode, Mean