Concordia Spectrum PDF Crawling and Clustering Project

1. Project Overview

This project implements a full web crawling, PDF text extraction, indexing, and clustering pipeline on the Concordia Spectrum document repository. The system:

Crawls Spectrum webpages using Scrapy
Downloads PDF documents
Extracts clean human-readable text using pdfminer.six
Tokenizes and indexes the documents into an inverted index
Reconstructs document text from the index
Applies TF–IDF vectorization
Performs K-Means clustering with:
- K = 2
- K = 10
- K = 20
Outputs top-ranked TF–IDF terms per cluster for semantic interpretation

2. Software Environment

Python version used:

Python 3.12.5

Install all dependencies with:

pip install -r requirements.txt

Directory structure:

Project2/
│
├── process_pipeline.py
├── cluster_documents.py
├── index.json
├── requirements.txt
├── README.md
│
├── crawler/
│   └── spider.py
│
└── indexer/
    ├── inverted_index.py
    └── tokenizer.py

STEP 1:

python process_pipeline.py

STEP 2:

python cluster_documents.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Concordia Spectrum PDF Crawling and Clustering Project

1. Project Overview

2. Software Environment

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
crawler		crawler
indexer		indexer
README.md		README.md
Readme.MD		Readme.MD
cluster_documents.py		cluster_documents.py
cluster_myCollection.py		cluster_myCollection.py
process_pipeline.py		process_pipeline.py
requirements.txt		requirements.txt
run_queries.py		run_queries.py

Folders and files

Latest commit

History

Repository files navigation

Concordia Spectrum PDF Crawling and Clustering Project

1. Project Overview

2. Software Environment

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages