Getting to know each other, discussing the plan, the guidelines for sharing the assignments, etc.
Going through a small demo/tutorial for Jupyter Notebooks. See [demo notebook here](1. Intro to Jupyter Notebooks - Python.ipynb).
Create simple prediction models from small datasets. See [demo notebook here](2. Linear Regression - Python.ipynb)
- House prices: Predict the price of a house based on surface, lot size, #bathrooms, #bedrooms, etc.
- Titanic survivability: Predict the likelyhood of someone surviving the sinking of the Titanic based on their gender, age, passenger class and some other variables.
- Video Game sales with ratings: Predict how well a game will sell based on the critic rating, user rating, publisher and genre.
Go over binary classification problems and some algorithms for solving them, e.g logistic regression. See [demo notebook here](3. Binary Classification - Python.ipynb)
- Medical Appointment No Shows: predict whether a patient will show up for his scheduled appointment
- Rotten Tomatoes moview reviews: predict whether the review is positive or not (ant not the score itself).
- Amazon Fine Foods reviews: multivariate regression: predict whether the review is positive or not and whether other users find it helpful.
- Credit card fraud detection: predict whether a transaction is fraudulent or not.
- HR Analytics - When will an employee leave the company: predict whether an employee is likely to leave the company.
Solve some simple clustering prodblems with K-nearest neighbors/K-means. See [demo notebook here](4. Clustering - Python.ipynb)
Create a model for product recommendations with collaborative filtering. See [demo notebook here](5. Collaborative Filtering - Python.ipynb)
There's no machine learning without something to learn. This section contains a list of places where you can find datasets useful for a ML study group / course / training.
- Kaggle Datasets: Large collection of datasets, many of them already explored and explained by other community members.
- UCI Machine Learning Repository: Loads of datasets, with the suitable technique (regression, classification, etc) pointed out for each of them.
- Google research datasets
- AWS-hosted Public Datasets: Quite large datasets, but many interesting ones containing real world data
- https://datahub.io
- The default sci-kit learn datasets
- mldata.org dataset repository
There are many sources that cover the theory of machine learning.
-
Machine Learning: Hands-On for Developers and Technical Professionals - Jason Bell, 2014: A book that touches many of the ML techniques in a developer-friendly way, with working code examples.
-
Machine Learning: From Theory to Algorithms - Shai Shalev-Shwartz, Shai Ben-David, 2014: This book explores most machine learning branches in great depth, with the caveat of also being very theory-heavy. Better get a refresh on your greek alphabet before diving in.
Diagrams that assist you in choosing the correct model to train:
- scikit-learn - Choosing the right estimator
- Machine learning algorithm cheatsheet for Microsoft Azure ML Studio
Note: these only hint the correct algorithm to use for a particular situation and are still useful regardless of the platform one uses.
- Anaconda: Simple way to offline install Python, Jupyter Notebooks and all required libraries for data science & machine learning. Should work for other languages besides Python (R, Ruby, Scala, Java, JS) but untested. Feel free to add details here if you've tried it.
- RStudio: Very nice IDE for R
- Kaggle: Online hosting of Jupyter Notebooks. Supports Python (2?) and R.
- Azure Notebooks: Online hosting of Jupyter Notebooks. Supports Python 2&3, R and F#
- Anaconda Cloud: Packages must be developed offline, but can then be uploaded to Anaconda Cloud and shared with everyone.
To anyone interested in using any of these: Feel free to add dedicated sections.
-
Gist: Preferred way of sharing code snippets.
-
Jupyter Notebook viewer: Allows viewing of Jupyter notebooks from any URL, github repo or gist.
Highly recommended course available for free on Coursera: Basic Statistics, by University of Amsterdam
Statistics cheatsheets: