A demo exercise is being built with my personal data. So far:
import datadotworld
#dataset = datadotworld.load_dataset('https://data.world/ectest123/survey-2016') #notice the name in the url:
# I changed the name of the project to "Amphibians" but it was not updated in the url !!
dataset = datadotworld.load_dataset('https://data.world/ectest123/testdatasets') #the API seems to read one file per project and several for datasets; there is no distinction between both in the url, the owner must know
dataset.describe() # to get a description of the "dataset", which is actually the project
# output was:
#{'title': 'TestDatasets', 'resources': [{'path': 'data/bouwprojecten.csv', 'name': 'bouwprojecten', 'format': #'csv'}, {'format': 'pkl', 'path': 'original/allamphibians.pkl', 'name': 'original/allamphibians.pkl', 'mediatype': #'application/octet-stream', 'bytes': 269171}, {'format': 'csv', 'path': 'original/bouwprojecten.csv', 'name': #'original/bouwprojecten.csv', 'mediatype': 'text/csv', 'bytes': 143452}, {'format': 'zip', 'path': #'original/bouwprojecten.zip', 'name': 'original/bouwprojecten.zip', 'mediatype': 'application/zip', 'bytes': #18608}], 'name': 'ectest123_testdatasets', 'homepage': 'https://data.world/ectest123/testdatasets'}
for f in [dataset.dataframes, dataset.tables, dataset.raw_data]: #listing only raw_data because all pickled files are binary
print(f)
#output was:
#{'bouwprojecten': LazyLoadedValue(<pandas.DataFrame>)}
#{'bouwprojecten': LazyLoadedValue(<list of rows>)}
#{'original/bouwprojecten.zip': LazyLoadedValue(<bytes>), 'bouwprojecten': LazyLoadedValue(<bytes>), #'original/allamphibians.pkl': LazyLoadedValue(<bytes>), 'original/bouwprojecten.csv': #LazyLoadedValue(<bytes>)}
### working on the pickle file
unpickled = pickle.loads(dataset.raw_data['original/allamphibians.pkl']) #use the `loads` method, not the `load` method
# unpickled is my file!
### working on the zipfile
# check the following references:
# --- https://stackoverflow.com/questions/9887174/how-to-convert-bytearray-into-a-zip-file
# --- https://docs.python.org/3/library/io.html
# --- http://code.activestate.com/recipes/52265-read-data-from-zip-files/
import zipfile
import io
f = io.BytesIO(dataset.raw_data['original/bouwprojecten.zip'])
uzf = zipfile.ZipFile(f, "r")
uzf.namelist()
# output => ['bouwprojecten.csv']
This is an exploration of external databases for some of the datasets, following a discussion started at freeCodeCamp/2017-new-coder-survey#7 by @pdurbin.
A demo exercise is being built with my personal data. So far:
pipof very recent update: https://pypi.python.org/pypi/datadotworldpicklefile -the Python package apparently doesn't handle commands to deal with formats that won't havereadorreadlinemethods; for handling pickle files a more elaborated code would be required (eg. https://pypkg.com/pypi/vecshare/f/vecshare/signatures.py); pickle is very much Python and shouldn't be used, but that means files that were loaded in different formats, like compressed ones, might not be easily extractedIt was found later that the following script would unpickle the pickled file:
pandaslibraryFor more information about datadotworld and similar check the following list: https://docs.google.com/spreadsheets/d/1KptHzDHIdB3s1v5m1mMwphcwXhOVWdkRYdjEWW1dqrE/edit#gid=355072175