- A curated and opinionated list of the learning resources for data scientist.
- During my professional career, I have always been thinking what can I learn to improve myself. While there are a lot of resources on the internet, deciding what resources to learn with my limited amount time can yield the highest return is actually a time consuming process.
- This document is the resources/classes that I think can help improve myself as a data scientist, which may or may not work for you.
- When I use the word data scientist, I actually refer to the union of data analysts, data scientists, and data engineers. See What is a data scientist for my viewpoint.
- Focus more on technical skill, and less on soft (people) or management skill.
While I create this document for my own learning purpose, this document is also suited for
- anyone who is currently a data scientists and wants to improve himself/herself
- anyone who is not a data scientist but wants to step into the data scientist world
I expect the reader to has basic skills in statistics, machine learning, and programming. If you have a BS or MS degree in STEM, then you should have the prerequisite skills.
Despite no universally agreed definition for data scientist, here is my viewpoint.
A data scientist is someone who creates value out of data.
There are several ways to create value out of data. And based on the how to create value, people have created different job titles.
- Data analysts: analyze existing data to answer business problem
- Data scientist: create algorithm/model to solve business problem
- Data engineer: build algorithm/model in production
While the separation makes sense from a hiring perspective, I don't see it make much sense from a learning perspective. Therefore, in this document I include all of them in the big data scientist umbrella.
Instead of simply listing out the resources/courses that I find useful, I take a different approach.
First, I will list the key responsibility for data scientists. For each responsibility, I will include a brief description and the skill sets that needed for this responsibility.
For some skill set, I will include the resources that I find useful in improving this skill set.
- Build data pipeline
- Description
- In most cases, raw data are not suitable for data analysis. They need to be read, cleaned, and processed data before they can be used in data analysis
- This is also referred to as
extract, transform, load (ETL). - Traditionally, this is what data engineers do.
- Skill sets
- Science: basic statistics
- Science: basic machine learning
- Programming best practices
- Programming: python
- Programming: Golang
- Programming: python machine learning packages (numpy, scipy, pandas, scikit-learn)
- Technology: Docker
- Technology: Spark
- Technology: AWS
- Technology: Azure
- Technology: GCP
- Description
- Analyze data to extract business value
- Description
Business valuemay come in various forms. But the most common form is find answer for a particular business problem. For example, explain why the system behaves like this, what can be done to increase the revenue and reduce cost.- Traditionally, this is what data analysts do.
- Skill sets
- Good presentation skill to explain technical results to business & product people
- Good understanding of the business (what is valuable, what is the operation cost, etc)
- Science: basic statistics
- Science: basic machine learning
- Programming: python
- Technology: data visualization
- Description
- Develop models for business problems
- Description
- Common situations
- The models may be machine learning models, signal processing algorithms, are combinations of both.
- The data scientist may need to design experiments to collect the data, if there is no data available.
- The data scientist may need to define the performance evaluation metric. It needs to be not only computable but also understandable by the stakeholder. It needs to be computable so that it can be used in the algorithm development. It needs to be understandable so that the stakeholders can understand the outcome of the algorithm.
- After the models are built, it is the data scientist's job to present the result to stakeholders. This means data scientists are expected to explain the algorithms/models to non-technical people (stakeholders). Data scientists are also expected to make a product suggestion.
- Traditionally, this is what data scientists do.
- Common situations
- Skill sets
- Good presentation skill to explain technical results to business & product people
- Good understanding of the business (what is valuable, what is the operation cost, etc)
- Science: basic statistics
- Science: basic machine learning
- Science: advanced machine learning
- Programming: python
- Programming: python machine learning packages (numpy, scipy, pandas, scikit-learn)
- Technology: Tensorflow
- Technology: PyTorch
- Technology: Docker
- Technology: AWS
- Technology: Azure
- Technology: GCP
- Description
- Create a prototype
- Description
- There are two purposes for creating a prototype of the developed model.
- It can be used either internally or by alpha customers to realize the business value before the new model is put into production.
- It will be the reference point when ths new model is being implemented in production.
- There are various ways of realizing the prototype. The commons ways are:
- Creating a http endpoint
- Creating an importable code module
- Creating a website where user can interact with the new model
- Creating an stand-alone tool (say, a Docker image) for user to run the model locally
- Traditionally, this is what data scientists do.
- There are two purposes for creating a prototype of the developed model.
- Skill sets
- Programming best practices
- Programming: python
- Programming: python machine learning packages (numpy, scipy, pandas, scikit-learn)
- Technology: Tensorflow
- Technology: PyTorch
- Technology: Docker
- Technology: Protobuf
- Technology: AWS
- Technology: Azure
- Technology: GCP
- Technology: Rest API
- Technology: gRPC
- Description
- Implement model in production
- Description
- Traditionally, this is what data engineers or software engineers do.
- However, since the data scientist is the one who create the algorithm, it is inevitable for the data scientist to be heavily involved in the productionalization process. In some small-mid size companies, it is also not uncommon to ask the data scientist to write production code, or at least part of it.
- Skill sets
- Programming best practices
- Programming: System design
- Programming: python
- Technology: Docker
- Technology: Protobuf
- Technology: Terraform
- Technology: AWS
- Technology: Azure
- Technology: GCP
- Technology: Rest API
- Technology: gRPC
- Technology: Databases
- Technology: Kubernetes
- Technology: Jenkins
- Description
-
Clean Architecture: A Craftsman's Guide to Software Structure and Design
-
The Complete Data Structures and Algorithms Course in Python
Useful resources that are not classified (yet) in any of the responsibility.