Skip to content

MRCSIBR/ZH_webscraper

Repository files navigation

ZH_web scraper

Description

This project is a web scraper that collects data from various sources.

Question and Column Layout

  • Question: QA/title
  • URL: url's
  • Author: authors
  • Author URL: user_links
  • Content: answers and article
  • PID: single id for articles / Composite id's for Q/A
  • Image URL: image_urls
  • Date: timestamps
  • Approval: upvotes

Getting started

  1. First we need to collect keyword pages, then pass them to main_search module. After the search is done concatenate all csv, and check for duplicates.

  2. Once a csv with all url's is created, we can read it to grab question and articles data with <name_search.csv>

  3. Concatenate final question and article datasets

ToDo

Research comment data: Comment data cannot be fetched, this is because endpoints are hashed and change for each page.

License

Open source project

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages