ZH_web scraper

Description

This project is a web scraper that collects data from various sources.

Question and Column Layout

Question: QA/title
URL: url's
Author: authors
Author URL: user_links
Content: answers and article
PID: single id for articles / Composite id's for Q/A
Image URL: image_urls
Date: timestamps
Approval: upvotes

Getting started

First we need to collect keyword pages, then pass them to main_search module. After the search is done concatenate all csv, and check for duplicates.
Once a csv with all url's is created, we can read it to grab question and articles data with <name_search.csv>
Concatenate final question and article datasets

ToDo

Research comment data: Comment data cannot be fetched, this is because endpoints are hashed and change for each page.

License

Open source project

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
column_check.py		column_check.py
duplicate_check.py		duplicate_check.py
main_search.py		main_search.py
my_scraper_model		my_scraper_model
scrape_column_v2.py		scrape_column_v2.py
scrape_question_v2.py		scrape_question_v2.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ZH_web scraper

Description

Question and Column Layout

Getting started

ToDo

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ZH_web scraper

Description

Question and Column Layout

Getting started

ToDo

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages