Read/write large datasets in JupyterLab environment

Repository mining often requires trial and error to properly answer complex data-driven questions. When working with large datasets (i.e. large repositories with a long development history), pulling down data from the GitHub API can take time. Being able to cache or save these large datasets to/from files can eliminate this time delay and remove any potential of hitting the [GitHub rate limit](https://docs.github.com/en/rest/overview/resources-in-the-rest-api#rate-limiting).

The naïve approach would be to use the standard Python 3 built-in [input/output](https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files) functions and the data encoding/decoding capabilities of the [`json`](https://docs.python.org/3/library/json.html) library. For example:

```python
# writing the data to the file, so we don't have to rerun it again
with open("pulls.json", "w") as f:
    f.write(json.dumps(pulls))
    
# if data is already extracted, simply load it into the environment
with open("pulls.json", "r") as f:
    pulls = json.loads(f.read())
```

However, for larger datasets this solution will throw the following errors if executed in a JupyterLab notebook environment:
```error
IOStream.flush timed out
[W 13:37:54.357 LabApp] IOPub data rate exceeded.
    The notebook server will temporarily stop sending output
    to the client in order to avoid crashing it.
    To change this limit, set the config variable
    `--NotebookApp.iopub_data_rate_limit`.
    
    Current values:
    NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
    NotebookApp.rate_limit_window=3.0 (secs)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read/write large datasets in JupyterLab environment #1

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Read/write large datasets in JupyterLab environment #1

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions