TEST: Suggesting the use of external databases: datadotworld

This is an exploration of external databases for some of the datasets, following a discussion started at https://github.com/freeCodeCamp/2017-new-coder-survey/issues/7 by @pdurbin.

A demo exercise is being built with my personal data. So far:
* https://data.world has an API to extract the data in different applications and languages; Python is explored
* open source project : https://github.com/datadotworld
* A public project with a pickle-based dataset was created
* The API works per account: only one token to manage the whole account; however projects can be either public or private;  is only one token for read/write and apparently the same for admin? (not checked)
* There is a Python package available in `pip` of very recent update: https://pypi.python.org/pypi/datadotworld
* Token is configurated in a simple hidden folder in the home directory; it is a requirement to communicate with datadotworld; Is it that only those registered in datadotworld would get a token? (not checked)
* The exercise used a `pickle` file - ~the Python package apparently doesn't handle commands to deal with formats that won't have `read` or `readline` methods; for handling pickle files a more elaborated code would be required (eg. https://pypkg.com/pypi/vecshare/f/vecshare/signatures.py); pickle is very much Python and shouldn't be used, but that means files that were loaded in different formats, like compressed ones, might not be easily extracted~ 
It was found later that the following script would unpickle the pickled file:
```
import datadotworld

#dataset = datadotworld.load_dataset('https://data.world/ectest123/survey-2016') #notice the name in the url: 
# I changed the name of the project to "Amphibians" but it was not updated in the url !!

dataset = datadotworld.load_dataset('https://data.world/ectest123/testdatasets') #the API seems to read one file per project and several for datasets; there is no distinction between both in the url, the owner must know

dataset.describe() # to get a description of the "dataset", which is actually the project
# output was:
#{'title': 'TestDatasets', 'resources': [{'path': 'data/bouwprojecten.csv', 'name': 'bouwprojecten', 'format': #'csv'}, {'format': 'pkl', 'path': 'original/allamphibians.pkl', 'name': 'original/allamphibians.pkl', 'mediatype': #'application/octet-stream', 'bytes': 269171}, {'format': 'csv', 'path': 'original/bouwprojecten.csv', 'name': #'original/bouwprojecten.csv', 'mediatype': 'text/csv', 'bytes': 143452}, {'format': 'zip', 'path': #'original/bouwprojecten.zip', 'name': 'original/bouwprojecten.zip', 'mediatype': 'application/zip', 'bytes': #18608}], 'name': 'ectest123_testdatasets', 'homepage': 'https://data.world/ectest123/testdatasets'}

for f in [dataset.dataframes, dataset.tables, dataset.raw_data]: #listing only raw_data because all pickled files are binary
      print(f)

#output was:
#{'bouwprojecten': LazyLoadedValue(<pandas.DataFrame>)}
#{'bouwprojecten': LazyLoadedValue(<list of rows>)}
#{'original/bouwprojecten.zip': LazyLoadedValue(<bytes>), 'bouwprojecten': LazyLoadedValue(<bytes>), #'original/allamphibians.pkl': LazyLoadedValue(<bytes>), 'original/bouwprojecten.csv': #LazyLoadedValue(<bytes>)}

### working on the pickle file

unpickled = pickle.loads(dataset.raw_data['original/allamphibians.pkl']) #use the `loads` method, not the `load` method
# unpickled is my file!

### working on the zipfile
# check the following references:
# --- https://stackoverflow.com/questions/9887174/how-to-convert-bytearray-into-a-zip-file
# --- https://docs.python.org/3/library/io.html
# --- http://code.activestate.com/recipes/52265-read-data-from-zip-files/

import zipfile
import io

f = io.BytesIO(dataset.raw_data['original/bouwprojecten.zip'])
uzf = zipfile.ZipFile(f, "r")
uzf.namelist() 
# output => ['bouwprojecten.csv']


```
* Using Spark and some big data capabilities; the platform offers some features to explore and manipulate datasets, including a Workspace
* Loaded *.csv files should be comma-separated to be easily used by the datadotworld platform capabilities; there are other simple restrictions but they won't affect the file if extracted
* There is a course for free in DataCamp (https://campus.datacamp.com/courses/intro-to-dataworld-in-python/) to show how to use the datadotworld API for Python in combination with `pandas` library
* The API is "queriable" in SQL
* Example of working the working with python AND github with datadotworld ---> https://www.dataquest.io/blog/datadotworld-python-tutorial/
* Other capabilities, using SQL and the UI https://data.world/jonloyens/an-intro-to-dataworld-dataset
* Example of projects with Gov organizations (2016) http://www.esa.doc.gov/under-secretary-blog/dataworld-bring-valuable-commerce-datasets-social-network-data-people
* Help is scattered, specially for API capabilities there is no much examples to be found - NOTE: this is not a relevant aspect as users would use the API to load and download data mostly
* It has some presence in Medium with its own publication (https://meta.data.world/) as well as in some Data Science related articles
* Only up to 1GB allowed per dataset section (probably using Databricks or similar in the background?)

For more information about datadotworld and similar check the following list: https://docs.google.com/spreadsheets/d/1KptHzDHIdB3s1v5m1mMwphcwXhOVWdkRYdjEWW1dqrE/edit#gid=355072175

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TEST: Suggesting the use of external databases: datadotworld #6

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

TEST: Suggesting the use of external databases: datadotworld #6

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions