New (as of October 2024) Crawler for University of Nottingham Course Catalogue

Our data source is the UoN Course Catalogue. The catalogue has undergone some big changes this year, but it's still far less good than Nott Course and people are asking me to update the data, so the crawler is here. I personally prefer Python so I ditched the old JS crawler and rewrote everything.

Acknowledgements

A big thanks to...

my friend Lucien for helping with the reimagined, Selenium-free module crawler (see issue #1);
and ChatGPT for making things easier.

I just want the data!

Either you can use Nott Course which is a beautiful interface for the data; or if you wish to do anything geeky, you can go to Releases and download the latest data as a SQLite database file.

Overview

This crawler has two parts: one for the (course) modules and one for the academic plans and they are written as two (Python) modules, module and plan.

An overview of the work flow:

When you run the module module, it will first obtain a list of schools of the three campuses and store it in a JSON file. Then it will obtain a list of modules of each school (like the information you see here). Then a POST request (with session information) is sent to fetch the link for the page of each module (see issue #1 for details), and then a GET request to fetch the module details (like the information you see here). We store the data in a SQLite database, where each column is TEXT -- for dictionaries or lists, they are json.dumps-ed into a string.
When you run the plan module, it will first obtain a list of academic plans for each campus, and store these pieces of 'plan brief' in a JSON file. Then it will fetch the detail of each plan (not using Selenium this time, so faster), and again store the data in a SQLite database.

Check schemas for the JSON schemas of the plan and module objects stored in the SQLite database.

Features

Concurrency!!!
Resumable download!!!

Get Started!

First you need a venv environment which I assume you know how to set up. Also modify other variables in module/config.py and plan/config.py if needed.

pip install -r requirements.txt
mkdir res
python -m plan.main
python -m module.main

If anything went wrong in the process of crawling, you can always just restart the script and it will resume downloading by skipping what has been fetched in the database. Then you should produce a data.db file in the res directory (if you didn't change the relevant config fields), which is used by the backend server.

Tips

If you see No link found in [Module Code], pass... whilst running python -m module.main, you can try rerun the command after it finishes.
There are roughly 6000 modules to fetch in total.

To-dos

Refactor the module module
Make module more stable (stable now after removing selenium dependency)
Output Data Specification
Rewrite this README
Conform to flake8
A blog post on how I developed these

The output data format has changed so nott-course also changed a bit.

Important note

campus should always be a single letter in ['C', 'M', 'U'], not the full name!!!

Change of output fields compared to the previous crawler

You don't need to read this section now.

Change of course fields:

Add corequisites
Add classComment
courseWebLinks has disappeared (it was useless anyway)
Add duration to assessments
In requisites and corequisites, ["code", "title"] has replaced "subject", "courseTitle"
Convenor is now a string (name of the person), not an object

Change of plan fields:

courseWeightings has become a string, not a table
degreeCalculationModel has become a string, not a table
subjectBenchmark HTML
planAccreditation HTML
school has become a string too
the only object field is now modules
academicLoad has disappeared
No furtherInformation
notwithstandingRegulations changing to additionalRegulations
added year and campus
remove "overview", "assessmentMethods", "teachingAndLearning" => everything in learning outcomes

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
common		common
fix_scripts		fix_scripts
module		module
plan		plan
schemas		schemas
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

New (as of October 2024) Crawler for University of Nottingham Course Catalogue

Acknowledgements

I just want the data!

Overview

Features

Get Started!

Tips

To-dos

Important note

Change of output fields compared to the previous crawler

About

Uh oh!

Releases 2

Packages

Uh oh!

Languages

License

EricWay1024/nottCrawlerNew

Folders and files

Latest commit

History

Repository files navigation

New (as of October 2024) Crawler for University of Nottingham Course Catalogue

Acknowledgements

I just want the data!

Overview

Features

Get Started!

Tips

To-dos

Important note

Change of output fields compared to the previous crawler

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Languages

Packages