A modular Python pipeline for extracting, cleaning, and classifying web data using open-source tools. Designed for smart web prospecting, offline research, and scalable automation.
- 🔍 Web Scraping: Collects structured data from target websites using
requestsandBeautifulSoup. - 🧹 Data Cleaning: Standardizes and prepares text for analysis using
pandas,re, andunicodedata. - 🧠 ML Classification: Predicts categories or relevance using a trained
scikit-learnclassifier. - 🗃️ Export Options: Saves results to
.csv,.json, and.txtformats for integration and sharing. - 🔐 Offline-First: Fully operable without internet after initial setup. No external API dependencies.
- 📁 Modular Design: Easy to adapt, reuse, or extend in parts or as a full pipeline.
- Local business discovery and qualification
- Lead generation and client research
- Academic data collection and preprocessing
- Custom classifiers for specific domains or keywords
| Function | Tools Used |
|---|---|
| Scraping | requests, BeautifulSoup |
| Cleaning | pandas, re, unicodedata |
| ML Classification | scikit-learn, joblib |
| Exporting | csv, json, plain text handling |
| Logging & CLI | argparse, logging, .env |
# 1. Install dependencies
pip install -r requirements.txt
# 2. Run the scraper (customize inside config or via CLI)
python scrape.py --target "https://example.com" --output data/raw.csv
# 3. Clean and preprocess the scraped data
python clean.py --input data/raw.csv --output data/clean.csv
# 4. Classify the cleaned data using the trained model
python classify.py --input data/clean.csv --model models/classifier.pkl --output results/predictions.csvEach step is independent and can be used modularly.
name,website,description,category,score
"ACME Bakery","acme.com","...","food",0.92
"FutureTech Solutions","futuretech.ai","...","tech",0.88
...scrape_classify_pipeline/
│
├── data/ # Raw and processed data
├── models/ # Trained models (.pkl)
├── results/ # Classification outputs
├── scrape.py # Web scraping logic
├── clean.py # Data cleaning script
├── classify.py # ML classification pipeline
├── utils/ # Reusable utility functions
├── config.env # Environment variables
└── README.md # Project overview
- Python 3.8+
- pandas
- beautifulsoup4
- scikit-learn
- requests
- joblib
This pipeline does not rely on cloud APIs or send data externally. Everything runs locally, making it ideal for private research or restricted environments.
Licensed under the MIT License – feel free to modify and use for personal or commercial projects.
Jose Daniel Soto 📧 Email | 🌐 GitHub | 🔗 LinkedIn