This repository contains files associated with a workshop on web scraping created by Sam Huskey for the University of Oklahoma's Digital Humanities Community of Practice.
This workshop is about using the BeautifulSoup module for Python to extract and format specific information from a set of web pages. It demonstrates some key functions and techniques, but there are many more advanced operations that cannot be covered in this brief forum.
A version of the Jupyter Notebook for this workshop is available on Google Colab at https://bit.ly/423Ijom. If you want to run the code cells, please do so in a copy of the file.
By the end of this tutorial, we'll have covered the most common techniques in web scraping, but there is much, much more to learn. If you're at OU, the best place to start is University Libraries' Digital Scholarship and Data Services. They offer one-on-one consultations and on-request workshops.
There are many ways to scrape information from the internet.
You can just copy information from your browser window and paste it into your favorite word processing or spreadsheet program. That's the best approach if you just need to get a small amount of information from one or two pages. Beyond that, you should consider automating your scraping with one or more of the following tools.
This workshop focuses on using the popular BeautifulSoup module for Python. It comes with many predefined functions for handling common tasks in webscraping. It's particularly good at navigating the structure of web pages.
- Beautiful Soup Documentation: The documentation includes a walkthrough of the different methods for connecting to web sites and extracting information from them.
- Python for Digital Humanities Tutorial on BeautifulSoup: This is just one of the many tutorials available on WJB Mattingly's YouTube channel for digital humanities computing.
A spider can crawl sites or collections of sites and retrieve specific information according to the instructions you give it. Scrapy is a Python module for building spiders.
You can also retrieve information using command line tools like wget. The Programming Historian site has some tutorials to help:
Many commercial services offer low-code or no-code solutions for scraping content from the interent:
To use Conda to recreate the Python environment used in this tutorial, run this in the Terminal:
conda env create -f environment.ymlTo use Pip to recreate the Python environment used in this tutorial, run this in the Terminal
python -m venv webscraping
source venv/bin/activate
# On Windows: venv\Scripts\activate
pip install -r requirements.txt