Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
dc67dd8
Edit file with letter c, my collection directory
tgoedecke Oct 10, 2018
eebf6a7
Update pgoedec1.ipynb (Python 3)
tgoedecke Oct 10, 2018
a2b32f9
Update pgoedec1.ipynb
tgoedecke Oct 10, 2018
3bd25a6
Update collname repeat
tgoedecke Oct 10, 2018
7d4c5b7
Update README.md
audrism Oct 15, 2018
0af3808
Update README.md
audrism Oct 15, 2018
0f85604
Create readNpm.py
audrism Oct 15, 2018
fea34c9
Update README.md
audrism Oct 15, 2018
ce27f26
Update README.md
audrism Oct 15, 2018
71d6f1c
Update README.md
audrism Oct 15, 2018
b82c2af
Create readGit.py
audrism Oct 15, 2018
ad9edae
Update README.md
audrism Oct 15, 2018
602a742
Merge pull request #20 from tgoedecke/master
tgoedecke Oct 17, 2018
2327f0a
update readGit.py
tgoedecke Oct 17, 2018
6083d25
Update README.md
audrism Oct 17, 2018
011f2d2
Update readNpm.py
audrism Oct 17, 2018
4b21422
Update readGit.py
audrism Oct 17, 2018
cfa97f4
fixed readGit.py
audrism Oct 17, 2018
783c96f
Update README.md
audrism Oct 17, 2018
4753959
Update README.md
audrism Oct 18, 2018
8a51fdd
Update README.md
audrism Oct 18, 2018
a156c9d
added extraction scripts
audrism Oct 18, 2018
35aa9eb
Merge branch 'master' of gh:fdac18/Miniproject2
audrism Oct 18, 2018
b3e1d3a
added script to compare releases
audrism Oct 18, 2018
d10980a
Added compareRels.py
audrism Oct 18, 2018
1da1844
Update readNpm.py
audrism Oct 19, 2018
8a44a89
Update README.md
audrism Oct 19, 2018
2c8a408
Fix case when there is no common ancestor between commits
EvanEzell Oct 23, 2018
d725d16
Update README.md
audrism Oct 29, 2018
b540b8f
Update README.md
audrism Oct 29, 2018
88fa5f1
Update README.md
audrism Oct 29, 2018
bd3eeb6
Merge pull request #29 from EvanEzell/patch-1
audrism Oct 29, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
126 changes: 125 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,128 @@
# MiniProject2: Discover a list of projects on SourceForge.net and GitLab.com
# MiniProject2: Phase2: Store info on NPM packages in MongoDB

## Task: Getting Release info from GitHub on NPM packages

### Resources:
NPM package list

The list of packages is unique to each one of you:
/data/NPMvulnerabilities/NPMpkglist/NPMpkglist_XX.gz
where XX is between 0 and 33: to find your number look at the list below.

### Goal:
1. Download and store data from npm on all your packages on mongodb database:
fdac18mp2, collection: npm_yourutkid, the example code is in readNpm.py
```
zcat /data/NPMvulnerabilities/NPMpkglist/NPMpkglist_XX.gz | python3 readNpm.py
```
Please keep in mind that /data/NPMvulnerabilities/ is not on gcloud, only
on da2, so please run it on da2 or copy NPMpkglist_XX.gz to gcloud.

2. Identify the packages that have GH repos (based on the stored info)
```
import pymongo, json, sys
client = pymongo.MongoClient ()
db = client ['fdac18mp2']
id = sys.argv[1] #your utkid
coll = db [ 'npm_' + id]
for r in coll.find():
if 'collected' in r:
r = r['collected']
if 'metadata' in r:
r = r['metadata']
if 'repository' in r:
r = r['repository']
if 'url' in r:
r = r['url']
print (r)
```
The above code is in extrNpm.py. To output the urls:
```
python3 extrNpm.py > myurls
```

3. For each such package, get a list of all releases. Example file is readGit.py (you can use it with the snippet above to get releases). It reads from standard input and populates
releases_yourutkid collection. Reference to Github API:
```
cat myurls | python3 readGit.py
#or
python3 readGit.py < myurls
```
4. Extract releases from mongodb
```
import pymongo, json, sys
client = pymongo.MongoClient (host="da1")
db = client ['fdac18mp2']
id = "audris"
coll = db [ 'releases_' + id]
for r in coll.find():
n = r['name']
if 'values' in r:
for v in r['values']:
if 'tag_name' in v:
print (n+';'+v['tag_name'])
```
The above code is in extrRels.py. To output the urls:
```
python3 extrRels.py > myrels
```


5. Find no. of commits between the latest and other releases.

For example:
E.g. https://api.github.com/repos/webpack-contrib/html-loader/compare/v0.5.4...master or https://api.github.com/repos/git/git/compare/v2.2.0-rc1...v2.2.0-rc2
More resource: https://stackoverflow.com/questions/26925312/github-api-how-to-compare-2-commits (look for comparing the tags in the answer)
Get the data from the json, look for something like to get no. of commits between releases
```
"status": "ahead",
"ahead_by": 24,
"behind_by": 0,
"total_commits": 24,
```
For example
```
cat myrels | python3 compareRels.py > myrels.cmp
```

| number | GitHub Username | NetID | Name |
|:-:|:-:|:-:|---|
| 0 | 3PIV | pprovins | Provins IV, Preston |
| 1 | BrettBass13 | bbass11 | Bass, Brett Czech |
| 2 | CipherR9 | gyj992 | Johnson, Rojae Antonio |
| 3 | Colsarcol | cmawhinn | Mawhinney, Colin Joseph |
| 4 | EvanEzell | eezell3 | Ezell, Evan Collin |
| 5 | MikeynJerry | jdunca51 | Duncan, Jerry |
| 6 | Tasmia | trahman4 | Rahman, Tasmia |
| 7 | awilki13 | awilki13 | Wilkinson, Alex Webb |
| 8 | bryanpacep1 | jpace7 | Pace, Jonathan Bryan |
| 9 | caiwjohn | cjohn3 | John, Cai William |
| 10 | cflemmon | cflemmon | Flemmons, Cole |
| 11 | dbarry9 | dbarry | Barry, Daniel Patrick |
| 12 | desai07 | adesai6 | Desai, Avie |
| 13 | gjones1911 | gjones2 | Jones, Gerald Leon |
| 14 | herronej | eherron5 | Herron, Emily Joyce |
| 15 | hossain-rayhan | rhossai2 | Hossain, Rayhan |
| 16 | jdong6 | jdong6 | Dong, Jeffrey Jing |
| 17 | jyu25utk | jyu25 | Yu, Jinxiao |
| 18 | mkramer6 | mkramer6 | Kramer, Matthew S |
| 19 | mmahbub | mmahbub | Mahbub, Maria |
| 20 | nmansou4 | nmansou4 | Mansour, Nasib |
| 21 | nschwerz | nschwerz | Schwerzler, Nicolas Winfield William |
| 22 | rdabbs42 | rdabbs1 | Dabbs, Rosemary |
| 23 | saramsv | mousavi | Mousavicheshmehkaboodi, Sara |
| 24 | spaulsteinberg | ssteinb2 | Steinberg, Samuel Paul |
| 25 | zol0 | akarnauc | Karnauch, Andrey |
| 26 | zrandall | zrandall | Randall, Zachary Adams |
| 27 | lpassarella | lpassare | Passarella, Linsey Sara |
| 28 | tgoedecke | pgoedec1 | Goedecke, Trish |
| 29 | ray830305 | hchang13 | Chang, Hsun Jui |
| 30 | ssravali | ssadhu2 | Sadhu, Sri Ravali |
| 31 | diadoo | jpovlin | Povlin, John P |
| 32 | mander59 | mander59 | Anderson, Matt Mcguffee |
| 33 | iway1 | iway1 | Way, Isaac Caldwell |

# MiniProject2: Phase1: Discover a list of projects on SourceForge.net and GitLab.com


These two forges present two different types of data discovery challenges.
Expand Down
83 changes: 83 additions & 0 deletions compareRels.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
import sys, re, pymongo, json, time
import datetime
from requests.auth import HTTPBasicAuth
import requests
gleft = 1500

#client = pymongo.MongoClient ()
client = pymongo.MongoClient (host="da1.eecs.utk.edu")
login = sys.argv[1]
passwd = sys.argv[2]

baseurl = 'https://api.github.com/repos'
headers = {'Accept': 'application/vnd.github.v3.star+json'}
headers = {'Accept': 'application/vnd.github.hellcat-preview+json'}

db = client['fdac18mp2'] # added in class
collName = 'releases_audris'
coll = db [collName]
def wait (left):
while (left < 20):
l = requests .get('https://api.github.com/rate_limit', auth=(login,passwd))
if (l.ok):
left = int (l.headers.get ('X-RateLimit-Remaining'))
reset = int (l.headers.get ('x-ratelimit-reset'))
now = int (time.time ())
dif = reset - now
if (dif > 0 and left < 20):
sys.stderr.write ("waiting for " + str (dif) + "s until"+str(left)+"s\n")
time .sleep (dif)
time .sleep (0.5)
return left

def get (url):
global gleft
gleft = wait (gleft)
values = []
# sys.stderr.write ("left:"+ str(left)+"s\n")
try:
r = requests .get (url, headers=headers, auth=(login, passwd))
time .sleep (0.5)
if (r.ok):
gleft = int(r.headers.get ('X-RateLimit-Remaining'))
lll = r.headers.get ('Link')
links = ['']
if lll is not None:
links = lll.split(',')
except Exception as e:
sys.stderr.write ("Could not get:" + url + ". Exception:" + str(e) + "\n")
return (json.loads(r.text))

def chunks(l, n):
if n < 1: n = 1
return [l[i:i + n] for i in range(0, len(l), n)]

def cmp_rel (url):
v = []
size = 0
try:
v = get (url)
except Exception as e:
sys.stderr.write ("Could not get:" + url + ". Exception:" + str(e) + "\n")
if 'ahead_by' in v and 'behind_by' in v:
print (url+';'+str(v['ahead_by'])+';'+str(v['behind_by']))
else:
sys.stderr.write ("Could not compare releases for: " + url + "; There exists no common ancestor between the two versions." + "\n")


p2r = {}
for l in sys.stdin.readlines():
l = l.rstrip()
p, r = l.split(';')
if p in p2r:
p2r[p] .append (r)
else:
p2r[p] = [r]

for p in p2r:
rs = p2r[p]
if len (rs) > 1:
for i in range(1,len (rs)):
url = 'https://api.github.com/repos/'+p+'/compare/' + rs[i-1] + '...' + rs[i]
cmp_rel (url)

15 changes: 15 additions & 0 deletions extrNpm.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
import pymongo, json, sys
client = pymongo.MongoClient (host="da1")
db = client ['fdac18mp2']
id = "audris"
coll = db [ 'npm_' + id]
for r in coll.find():
if 'collected' in r:
r = r['collected']
if 'metadata' in r:
r = r['metadata']
if 'repository' in r:
r = r['repository']
if 'url' in r:
r = r['url']
print (r)
11 changes: 11 additions & 0 deletions extrRels.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
import pymongo, json, sys
client = pymongo.MongoClient (host="da1")
db = client ['fdac18mp2']
id = "audris"
coll = db [ 'releases_' + id]
for r in coll.find():
n = r['name']
if 'values' in r:
for v in r['values']:
if 'tag_name' in v:
print (n+';'+v['tag_name'])
4 changes: 2 additions & 2 deletions jdunca51.ipynb → pgoedec1.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,8 @@
"from bs4 import BeautifulSoup\n",
"\n",
"dbname = \"fdac18mp2\" #please use this database\n",
"collname = \"glprj_jdunca51\" #please modify so you store data in your collection\n",
"my_char = 'f'\n",
"collname = \"tgoedecke\" #please modify so you store data in your collection\n",
"my_char = 'c'\n",
"\n",
"# beginning page index\n",
"begin = \"1\"\n",
Expand Down
126 changes: 126 additions & 0 deletions readGit.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
import sys, re, pymongo, json, time
import datetime
from requests.auth import HTTPBasicAuth
import requests
gleft = 1500

#client = pymongo.MongoClient ()
client = pymongo.MongoClient (host="da1.eecs.utk.edu")
login = sys.argv[1]
passwd = sys.argv[2]

baseurl = 'https://api.github.com/repos'
headers = {'Accept': 'application/vnd.github.v3.star+json'}
headers = {'Accept': 'application/vnd.github.hellcat-preview+json'}

db = client['fdac18mp2'] # added in class
collName = 'releases_audris'
coll = db [collName]
def wait (left):
while (left < 20):
l = requests .get('https://api.github.com/rate_limit', auth=(login,passwd))
if (l.ok):
left = int (l.headers.get ('X-RateLimit-Remaining'))
reset = int (l.headers.get ('x-ratelimit-reset'))
now = int (time.time ())
dif = reset - now
if (dif > 0 and left < 20):
sys.stderr.write ("waiting for " + str (dif) + "s until"+str(left)+"s\n")
time .sleep (dif)
time .sleep (0.5)
return left

def get (url):
global gleft
gleft = wait (gleft)
values = []
size = 0
# sys.stderr.write ("left:"+ str(left)+"s\n")
try:
r = requests .get (url, headers=headers, auth=(login, passwd))
time .sleep (0.5)
if (r.ok):
gleft = int(r.headers.get ('X-RateLimit-Remaining'))
lll = r.headers.get ('Link')
links = ['']
if lll is not None:
links = lll.split(',')
t = r.text
size += len (t)
try:
array = json .loads (t)
for el in array:
values .append (el)
except Exception as e:
sys.stderr.write(str(e)+" in json .loads\n")
#t = r.text.encode ('utf-8')
while '; rel="next"' in links[0]:
gleft = int(r.headers.get ('X-RateLimit-Remaining'))
gleft = wait (gleft)
url = links[0] .split(';')[0].replace('<','').replace('>','');
try:
r = requests .get(url, headers=headers, auth=(login, passwd))
if (r.ok):
lll = r.headers.get ('Link')
links = ['']
if lll is not None:
links = lll .split(',')
t = r.text
size += len (t)
try:
array = json.loads (t)
for el in array:
values .append (el)
print ('in load next: ' + str(len (values)))
except Exception as e:
sys.stderr.write(str(e)+" in json .loads next\n")
else:
links = ['']
except requests.exceptions.ConnectionError:
sys.stderr.write('could not get ' + links + ' for '+ url + '\n')
#print u';'.join((u, repo, t)).encode('utf-8')
try:
print (url + ';' + str(values))
except Exception as e:
sys.stderr.write(str(e)+" in print " + url + "\n")
else:
print (url + ';ERROR r not ok')
except requests.exceptions.ConnectionError:
print (url + ';ERROR ConnectionError')
print ('returning nkeys=' + str(len (values)))
return values, size

def chunks(l, n):
if n < 1: n = 1
return [l[i:i + n] for i in range(0, len(l), n)]

for n in sys.stdin.readlines():
#first clean the url
n = n.rstrip()
n = re.sub("^.*github.com/","",n)
n = re.sub("\.git$","",n)
url = baseurl + '/' + n + '/releases'
url1 = url
print("trying to get: " + url1)
v = []
size = 0
try:
v, size = get (url1)
print (str (len (v)) + ';' + str (size) + ';' + url1)
sys .stdout .flush ()
except Exception as e:
sys.stderr.write ("Could not get:" + url1 + ". Exception:" + str(e) + "\n")
continue
print (url1 + ' after exception lenv(v)=' + str(len (v)))
ts = datetime.datetime.utcnow()
if len (v) > 0:
# size may be bigger in bson, factor of 2 doesnot always suffice
if (size < 16777216/3):
coll.insert_one ( { 'name': n, 'url': url, 'utc':ts, 'values': v } )
else:
s = size;
n = 3*s/16777216
i = 0
for ch in chunks (v, n):
coll.insert_one ( { 'chunk': i, 'name':n, 'url': url, 'utc':ts, 'values': ch } )
i = i + 1
Loading