We all produce a lot of data. All the time.
We need to treat all that data in order to make it useful and extract high-quality information from the text, that can be used for predictions and natural language processing.
The main objective here is to give a short information about some tools that data scientist have been using to data mining.
It's important to always focus on the business and see what are the tools that most fit with it.
In this project I used Python, in version 3.6.8.
We are using the content extract from this book, written by Alex Smola, about Machine Learning (great stuff, btw).
The techniques that we are going to use are:
1-Case alignment
2-Tokenization
3-Stopwords removal
4-Stemming
5-Lemmatization
You can see more information in the notebook, the data-preprocessing.ipynb archive, and the presentation that guides the content, the DataPreProcessing.pdf.
Enjoy! ๐