Skip to content

sjhuskey/webscraping

Repository files navigation

Web Scraping Workshop

This repository contains files associated with a workshop on web scraping created by Sam Huskey for the University of Oklahoma's Digital Humanities Community of Practice.

This workshop is about using the BeautifulSoup module for Python to extract and format specific information from a set of web pages. It demonstrates some key functions and techniques, but there are many more advanced operations that cannot be covered in this brief forum.

A version of the Jupyter Notebook for this workshop is available on Google Colab at https://bit.ly/423Ijom. If you want to run the code cells, please do so in a copy of the file.

By the end of this tutorial, we'll have covered the most common techniques in web scraping, but there is much, much more to learn. If you're at OU, the best place to start is University Libraries' Digital Scholarship and Data Services. They offer one-on-one consultations and on-request workshops.

Other Resources on Web Scraping

There are many ways to scrape information from the internet.

You can just copy information from your browser window and paste it into your favorite word processing or spreadsheet program. That's the best approach if you just need to get a small amount of information from one or two pages. Beyond that, you should consider automating your scraping with one or more of the following tools.

BeautifulSoup

This workshop focuses on using the popular BeautifulSoup module for Python. It comes with many predefined functions for handling common tasks in webscraping. It's particularly good at navigating the structure of web pages.

Scrapy

A spider can crawl sites or collections of sites and retrieve specific information according to the instructions you give it. Scrapy is a Python module for building spiders.

Command Line Tools

You can also retrieve information using command line tools like wget. The Programming Historian site has some tutorials to help:

Proprietary Applications

Many commercial services offer low-code or no-code solutions for scraping content from the interent:


Setting Up the Environment

Using Conda

To use Conda to recreate the Python environment used in this tutorial, run this in the Terminal:

conda env create -f environment.yml

Using Pip

To use Pip to recreate the Python environment used in this tutorial, run this in the Terminal

python -m venv webscraping
source venv/bin/activate  
# On Windows: venv\Scripts\activate
pip install -r requirements.txt

About

Tutorial on webscraping for the OU DH Community of Practice

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published