Web Scraping Workshop

This repository contains files associated with a workshop on web scraping created by Sam Huskey for the University of Oklahoma's Digital Humanities Community of Practice.

This workshop is about using the BeautifulSoup module for Python to extract and format specific information from a set of web pages. It demonstrates some key functions and techniques, but there are many more advanced operations that cannot be covered in this brief forum.

A version of the Jupyter Notebook for this workshop is available on Google Colab at https://bit.ly/423Ijom. If you want to run the code cells, please do so in a copy of the file.

By the end of this tutorial, we'll have covered the most common techniques in web scraping, but there is much, much more to learn. If you're at OU, the best place to start is University Libraries' Digital Scholarship and Data Services. They offer one-on-one consultations and on-request workshops.

Other Resources on Web Scraping

There are many ways to scrape information from the internet.

You can just copy information from your browser window and paste it into your favorite word processing or spreadsheet program. That's the best approach if you just need to get a small amount of information from one or two pages. Beyond that, you should consider automating your scraping with one or more of the following tools.

BeautifulSoup

This workshop focuses on using the popular BeautifulSoup module for Python. It comes with many predefined functions for handling common tasks in webscraping. It's particularly good at navigating the structure of web pages.

Beautiful Soup Documentation: The documentation includes a walkthrough of the different methods for connecting to web sites and extracting information from them.
Python for Digital Humanities Tutorial on BeautifulSoup: This is just one of the many tutorials available on WJB Mattingly's YouTube channel for digital humanities computing.

Scrapy

A spider can crawl sites or collections of sites and retrieve specific information according to the instructions you give it. Scrapy is a Python module for building spiders.

Command Line Tools

You can also retrieve information using command line tools like wget. The Programming Historian site has some tutorials to help:

Proprietary Applications

Many commercial services offer low-code or no-code solutions for scraping content from the interent:

Setting Up the Environment

Using Conda

To use Conda to recreate the Python environment used in this tutorial, run this in the Terminal:

conda env create -f environment.yml

Using Pip

To use Pip to recreate the Python environment used in this tutorial, run this in the Terminal

python -m venv webscraping
source venv/bin/activate  
# On Windows: venv\Scripts\activate
pip install -r requirements.txt

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
images		images
.gitignore		.gitignore
README.md		README.md
chat-query.md		chat-query.md
environment.yml		environment.yml
full_script.ipynb		full_script.ipynb
requirements.txt		requirements.txt
scraper.ipynb		scraper.ipynb
sec_universities.csv		sec_universities.csv
sec_universities.xlsx		sec_universities.xlsx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Web Scraping Workshop

Other Resources on Web Scraping

BeautifulSoup

Scrapy

Command Line Tools

Proprietary Applications

Setting Up the Environment

Using Conda

Using Pip

About

Uh oh!

Releases

Packages

Uh oh!

Languages

sjhuskey/webscraping

Folders and files

Latest commit

History

Repository files navigation

Web Scraping Workshop

Other Resources on Web Scraping

BeautifulSoup

Scrapy

Command Line Tools

Proprietary Applications

Setting Up the Environment

Using Conda

Using Pip

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages