forked from cbogart/githubscraper
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathREADME.txt
More file actions
31 lines (26 loc) · 1.37 KB
/
README.txt
File metadata and controls
31 lines (26 loc) · 1.37 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
To install this package, mysql.connector can be troublesome; what worked for
me was installing with:
pip install --egg mysql-connector
How to use it to download a github dataset:
* Download all githubarchive files into a directory for each year
* hardcode the appropriate directory in get_githubarchive_aliases.py
* run get_githubarchive_aliases.py
* Creates a RepoAliases mongo collection capturing how repos names have changed over time
Then for each scrape you want to do:
* List projects (owner/project) in a text file, CR-separated, called sample_set_(whatever).txt
* Fill in config.json with directories for saving stuff, plus a pointer to the sample_set file
* python extract_issue.py config.json
* Pipe into a log file, and check log for errors
* Rescrape any projects that failed, if the problem was temporary (e.g. network down)
* edit get_githubarchive_events_mongo.sh to use the right names/directories
* bash get_githubarchive_events_mongo.sh
* python make_canonical_project_list.py
* python list_gh_users_and_projects.py
* python lookup_user_info.py
* creates actor_info.csv with github users' name & location
* python lookup_user_info.py
* creates actor_info.csv with github users' name & location
* python username_match_mongo.py config.json
* writes to data directory
* python get_membership.py config.json
* reads from data directory