Skip to content

marmik28/Web-Crawler-Python

Repository files navigation

Concordia Spectrum PDF Crawling and Clustering Project


1. Project Overview

This project implements a full web crawling, PDF text extraction, indexing, and clustering pipeline on the Concordia Spectrum document repository. The system:

  1. Crawls Spectrum webpages using Scrapy
  2. Downloads PDF documents
  3. Extracts clean human-readable text using pdfminer.six
  4. Tokenizes and indexes the documents into an inverted index
  5. Reconstructs document text from the index
  6. Applies TF–IDF vectorization
  7. Performs K-Means clustering with:
    • K = 2
    • K = 10
    • K = 20
  8. Outputs top-ranked TF–IDF terms per cluster for semantic interpretation

2. Software Environment

Python version used:

Python 3.12.5

Install all dependencies with:

pip install -r requirements.txt

Directory structure:

Project2/
│
├── process_pipeline.py
├── cluster_documents.py
├── index.json
├── requirements.txt
├── README.md
│
├── crawler/
│   └── spider.py
│
└── indexer/
    ├── inverted_index.py
    └── tokenizer.py

STEP 1:

python process_pipeline.py

STEP 2:

python cluster_documents.py

About

This project implements a full web crawling, PDF text extraction, indexing, and clustering pipeline on the Concordia Spectrum document repository.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages