This project implements a full web crawling, PDF text extraction, indexing, and clustering pipeline on the Concordia Spectrum document repository. The system:
- Crawls Spectrum webpages using Scrapy
- Downloads PDF documents
- Extracts clean human-readable text using pdfminer.six
- Tokenizes and indexes the documents into an inverted index
- Reconstructs document text from the index
- Applies TF–IDF vectorization
- Performs K-Means clustering with:
- K = 2
- K = 10
- K = 20
- Outputs top-ranked TF–IDF terms per cluster for semantic interpretation
Python version used:
Python 3.12.5Install all dependencies with:
pip install -r requirements.txtDirectory structure:
Project2/
│
├── process_pipeline.py
├── cluster_documents.py
├── index.json
├── requirements.txt
├── README.md
│
├── crawler/
│ └── spider.py
│
└── indexer/
├── inverted_index.py
└── tokenizer.pySTEP 1:
python process_pipeline.pySTEP 2:
python cluster_documents.py