The purpose of this document is to provide support for the use of version
control (a lá git) in our data analysis and software
development efforts. In order to effectively use the git repository available
to us via the HBGDki collaboration, we will reserve that repository for
(cleaned/curated) data sets that will be used in our analysis and demonstration
code. Rather than using the GHAP git server to version code we produce, we
will generate and use repos under the HBGD-UCB GitHub Organization. Since we
expect all analysis code to rely on the data sets stored in the private GHAP
git repo, there is no need for our analysis code to be held in private
repositories (it ought to be useless to those unaffiliated with our work).
The HBGDki git repo is available for cloning (when on a
GHAP session) via
# ssh user.name@IP
# enter password
git clone https://user.name@git.ghap.io/stash/scm/hbgd-teams/UCB-SuperLearner.gitN.B., the GHAP data repo may only be cloned when logged in to a session associated with the GHAP system.
Any given git repo associated with the software development and data analysis
efforts may easily be cloned from the HBGD-UCB organization. For example, to
clone this repository, try:
git clone git@github.com:HBGD-UCB/good-git.gitUnlike the repo on the GHAP git server, all repos affiliated with our GitHub
organization ought to be clone-able from any system (provided your GitHub
credentials are set up properly) -- this means they ought to be accessible when
logged in to a GHAP session or on your local machine.
- When developing code (for data analysis or the software products), work on a
new branch that is aptly named with respect to the functionality you hope to
add with your work. Commit often, and once you've completed your work,
generate a pull
request to the
masterordevelopbranch of the target repo as appropriate. - We will be using the "
gitflow" branching model. This will help keep us on the same page about how and when to create branches as well as how to merge new additions to existing long-lasting branches (e.g.,develop).
The most common bad practice in using git is to store your version controlled
repos in a system that provides automatic backups (e.g., Dropbox). DON'T DO
THIS -- especially for repos on which you expect collaborators to contribute.
Why? Well, git works by storing snapshots (of the changes in a given
plain-text file between commits) while Dropbox makes near-constant backups. This
has the potential to lead to a conflicting HEAD, a problem that no one wants
to resolve.
tl;dr -- please, please don't use Dropbox with the git repos.
Here are a few useful notes and readings that may be useful in remedying any
problems that may arise when working with git. These range the spectrum from
applied to fairly technical:
- "Version Control with
git" (Software Carpentry) - a comprehensive introduction to the uses ofgitand social coding with GitHub. This is aimed towards students and research professionals. - "Happy
gitand GitHub for the useR" (Jenny Bryan, RStudio) - a comprehensive introduction to both the inner workings ofgitand best practices in using GitHub, with a focus on integrating these tools into your R workflow -- aimed towards masters-level university students. - "Tools for Reproducible Research" (Karl
Broman) - a University-level course on best
practices in modern reproducible research, which includes a couple of
lectures on
gitand GitHub. - "Introduction to
git" (Berkeley's Stat 159/259, Fall 2015, KJ Millman) - a thorough walkthrough of some commongitcommands as well as explanations of what these commands do when invoked. - "Elementary
gitwith GitHub" (Nima Hejazi) - some digest-style notes made after reading far more thorough introductions togit. These are often useful to me when I need to look up a specific bit of core functionality.