forked from caiusb/miner-utils
-
Notifications
You must be signed in to change notification settings - Fork 0
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Repository mining often requires trial and error to properly answer complex data-driven questions. When working with large datasets (i.e. large repositories with a long development history), pulling down data from the GitHub API can take time. Being able to cache or save these large datasets to/from files can eliminate this time delay and remove any potential of hitting the GitHub rate limit.
The naïve approach would be to use the standard Python 3 built-in input/output functions and the data encoding/decoding capabilities of the json library. For example:
# writing the data to the file, so we don't have to rerun it again
with open("pulls.json", "w") as f:
f.write(json.dumps(pulls))
# if data is already extracted, simply load it into the environment
with open("pulls.json", "r") as f:
pulls = json.loads(f.read())However, for larger datasets this solution will throw the following errors if executed in a JupyterLab notebook environment:
IOStream.flush timed out
[W 13:37:54.357 LabApp] IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.
Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working