Skip to content
This repository was archived by the owner on Mar 11, 2021. It is now read-only.
This repository was archived by the owner on Mar 11, 2021. It is now read-only.

TEST: Suggesting the use of external databases: datadotworld #6

@evaristoc

Description

@evaristoc

This is an exploration of external databases for some of the datasets, following a discussion started at freeCodeCamp/2017-new-coder-survey#7 by @pdurbin.

A demo exercise is being built with my personal data. So far:

  • https://data.world has an API to extract the data in different applications and languages; Python is explored
  • open source project : https://github.com/datadotworld
  • A public project with a pickle-based dataset was created
  • The API works per account: only one token to manage the whole account; however projects can be either public or private; is only one token for read/write and apparently the same for admin? (not checked)
  • There is a Python package available in pip of very recent update: https://pypi.python.org/pypi/datadotworld
  • Token is configurated in a simple hidden folder in the home directory; it is a requirement to communicate with datadotworld; Is it that only those registered in datadotworld would get a token? (not checked)
  • The exercise used a pickle file - the Python package apparently doesn't handle commands to deal with formats that won't have read or readline methods; for handling pickle files a more elaborated code would be required (eg. https://pypkg.com/pypi/vecshare/f/vecshare/signatures.py); pickle is very much Python and shouldn't be used, but that means files that were loaded in different formats, like compressed ones, might not be easily extracted
    It was found later that the following script would unpickle the pickled file:
import datadotworld

#dataset = datadotworld.load_dataset('https://data.world/ectest123/survey-2016') #notice the name in the url: 
# I changed the name of the project to "Amphibians" but it was not updated in the url !!

dataset = datadotworld.load_dataset('https://data.world/ectest123/testdatasets') #the API seems to read one file per project and several for datasets; there is no distinction between both in the url, the owner must know

dataset.describe() # to get a description of the "dataset", which is actually the project
# output was:
#{'title': 'TestDatasets', 'resources': [{'path': 'data/bouwprojecten.csv', 'name': 'bouwprojecten', 'format': #'csv'}, {'format': 'pkl', 'path': 'original/allamphibians.pkl', 'name': 'original/allamphibians.pkl', 'mediatype': #'application/octet-stream', 'bytes': 269171}, {'format': 'csv', 'path': 'original/bouwprojecten.csv', 'name': #'original/bouwprojecten.csv', 'mediatype': 'text/csv', 'bytes': 143452}, {'format': 'zip', 'path': #'original/bouwprojecten.zip', 'name': 'original/bouwprojecten.zip', 'mediatype': 'application/zip', 'bytes': #18608}], 'name': 'ectest123_testdatasets', 'homepage': 'https://data.world/ectest123/testdatasets'}

for f in [dataset.dataframes, dataset.tables, dataset.raw_data]: #listing only raw_data because all pickled files are binary
      print(f)

#output was:
#{'bouwprojecten': LazyLoadedValue(<pandas.DataFrame>)}
#{'bouwprojecten': LazyLoadedValue(<list of rows>)}
#{'original/bouwprojecten.zip': LazyLoadedValue(<bytes>), 'bouwprojecten': LazyLoadedValue(<bytes>), #'original/allamphibians.pkl': LazyLoadedValue(<bytes>), 'original/bouwprojecten.csv': #LazyLoadedValue(<bytes>)}

### working on the pickle file

unpickled = pickle.loads(dataset.raw_data['original/allamphibians.pkl']) #use the `loads` method, not the `load` method
# unpickled is my file!

### working on the zipfile
# check the following references:
# --- https://stackoverflow.com/questions/9887174/how-to-convert-bytearray-into-a-zip-file
# --- https://docs.python.org/3/library/io.html
# --- http://code.activestate.com/recipes/52265-read-data-from-zip-files/

import zipfile
import io

f = io.BytesIO(dataset.raw_data['original/bouwprojecten.zip'])
uzf = zipfile.ZipFile(f, "r")
uzf.namelist() 
# output => ['bouwprojecten.csv']


For more information about datadotworld and similar check the following list: https://docs.google.com/spreadsheets/d/1KptHzDHIdB3s1v5m1mMwphcwXhOVWdkRYdjEWW1dqrE/edit#gid=355072175

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions