Skip to content

Read/write large datasets in JupyterLab environment #1

@nelsonni

Description

@nelsonni

Repository mining often requires trial and error to properly answer complex data-driven questions. When working with large datasets (i.e. large repositories with a long development history), pulling down data from the GitHub API can take time. Being able to cache or save these large datasets to/from files can eliminate this time delay and remove any potential of hitting the GitHub rate limit.

The naïve approach would be to use the standard Python 3 built-in input/output functions and the data encoding/decoding capabilities of the json library. For example:

# writing the data to the file, so we don't have to rerun it again
with open("pulls.json", "w") as f:
    f.write(json.dumps(pulls))
    
# if data is already extracted, simply load it into the environment
with open("pulls.json", "r") as f:
    pulls = json.loads(f.read())

However, for larger datasets this solution will throw the following errors if executed in a JupyterLab notebook environment:

IOStream.flush timed out
[W 13:37:54.357 LabApp] IOPub data rate exceeded.
    The notebook server will temporarily stop sending output
    to the client in order to avoid crashing it.
    To change this limit, set the config variable
    `--NotebookApp.iopub_data_rate_limit`.
    
    Current values:
    NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
    NotebookApp.rate_limit_window=3.0 (secs)

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions