A python framework and git template for data scientists, teams, and workshop organizers aimed at making your data science reproducible
For most of us, data science is 5% science, 60% data cleaning, and 35% IT hell. Easydata focuses the 95% by helping you deliver
- reproducible python environments,
- reproducible datasets, and
- reproducible workflows
In other words, Easydata is a template, library, and workflow that lets you get up and running with your data science analysis, quickly and reproducibly.
Easydata is a framework for building custom data science git repos that provides:
- An prescribed workflow for collaboration, storytelling,
- A python framework to support this workflow
- A makefile wrapper for conda and pip environment management
- prebuilt dataset recipes, and
- a vast library of training materials and documentation around doing reproducible data science.
Easydata is not
- an ETL tooklit
- A data analysis pipeline
- a containerization solution, or
- a prescribed data format.
- anaconda (or miniconda)
- python3.6+ (we use f-strings. So should you)
- Cookiecutter Python package >= 1.4.0: This can be installed with pip by or conda depending on how you manage your Python packages:
once you've installed anaconda, you can install the remaining requirements (including cookiecutter) by doing:
conda create -n easydata python=3
conda activate easydata
python -m pip install -f requirements.txtcookiecutter https://github.com/hackalog/easydata
The directory structure of your new project looks like this:
LICENSE- Terms of use for this repo
Makefile- top-level makefile. Type
makefor a list of valid commands
- top-level makefile. Type
Makefile.include- Global includes for makefile routines. Included by
Makefile.
- Global includes for makefile routines. Included by
Makefile.env- Command for maintaining reproducible conda environment. Included by
Makefile.
- Command for maintaining reproducible conda environment. Included by
README.md- this file
catalog- Data catalog. This is where config information such as data sources and data transformations are saved
catalog/config.ini- Local Data Store. This configuration file is for local data only, and is never checked into the repo.
data- Data directory. often symlinked to a filesystem with lots of space
data/raw- Raw (immutable) hash-verified downloads
data/interim- Extracted and interim data representations
data/interim/cache- Dataset cache
data/processed- The final, canonical data sets for modeling.
docs- Sphinx-format documentation files for this project.
docs/Makefile: Makefile for generating HTML/Latex/other formats from Sphinx-format documentation.
notebooks- Jupyter notebooks. Naming convention is a number (for ordering),
the creator's initials, and a short
-delimited description, e.g.1.0-jqp-initial-data-exploration.
- Jupyter notebooks. Naming convention is a number (for ordering),
the creator's initials, and a short
reference- Data dictionaries, documentation, manuals, scripts, papers, or other explanatory materials.
reference/easydata: Easydata framework and workflow documentation.reference/templates: Templates and code snippets for Jupyterreference/dataset: resources related to datasets; e.g. dataset creation notebooks and scripts
reports- Generated analysis as HTML, PDF, LaTeX, etc.
reports/figures- Generated graphics and figures to be used in reporting
environment.yml- The user-readable YAML file for reproducing the conda/pip environment.
environment.(platform).lock.1yml- resolved versions, result of processing
environment.yml
- resolved versions, result of processing
setup.py- Turns contents of
MODULE_NAMEinto a pip-installable python module (pip install -e .) so it can be imported in python code
- Turns contents of
MODULE_NAME- Source code for use in this project.
MODULE_NAME/__init__.py- Makes MODULE_NAME a Python module
MODULE_NAME/data- code to fetch raw data and generate Datasets from them
MODULE_NAME/analysis- code to turn datasets into output products
The first time:
make create_environment
git init
git add .
git commit -m "initial import"
git branch easydata # tag for future easydata upgrades
Subsequent updates:
make update_environment
In case you need to delete the environment later:
conda deactivate
make delete_environment
- Early versions of Easydata were based on the excellent cookiecutter-data-science template.
- Thanks to the Tutte Institute for supporting the development of this framework.