This project is a web scraper that collects data from various sources.
- Question: QA/title
- URL: url's
- Author: authors
- Author URL: user_links
- Content: answers and article
- PID: single id for articles / Composite id's for Q/A
- Image URL: image_urls
- Date: timestamps
- Approval: upvotes
-
First we need to collect keyword pages, then pass them to main_search module. After the search is done concatenate all csv, and check for duplicates.
-
Once a csv with all url's is created, we can read it to grab question and articles data with <name_search.csv>
-
Concatenate final question and article datasets
Research comment data:
Comment data cannot be fetched, this is because endpoints are hashed and change for each page.
Open source project