From 5e4c5fb8771e2ff4d5c977a11b8d1e5f73bd1fa7 Mon Sep 17 00:00:00 2001
From: imhardikj <imhardikj@gmail.com>
Date: Tue, 1 Sep 2020 17:07:15 +0530
Subject: [PATCH 1/4] Best practices guide

---
 content/docs/user-guide/best-practices.md  | 131 +++++++++++++++++++++
 content/docs/user-guide/tips-and-tricks.md |  17 +++
 2 files changed, 148 insertions(+)
 create mode 100644 content/docs/user-guide/best-practices.md
 create mode 100644 content/docs/user-guide/tips-and-tricks.md

diff --git a/content/docs/user-guide/best-practices.md b/content/docs/user-guide/best-practices.md
new file mode 100644
index 0000000000..ea88597044
--- /dev/null
+++ b/content/docs/user-guide/best-practices.md
@@ -0,0 +1,131 @@
+# Best Practices for DVC Projects
+
+DVC provides a systematic approach towards managing and collaborating on data
+science projects. You can manage your projects with DVC more efficiently using
+the practices listed here:
+
+## Source code and data versioning
+
+You can use DVC to avoid discrepancies between
+[revisions](https://git-scm.com/docs/revisions) of source code and
+[versions](/doc/use-cases/versioning-data-and-model-files) of data files. DVC
+replaces all large data files, models, etc. with small
+[metafiles](doc/user-guide/dvc-files-and-directories) (tracked by Git). These
+files point to the original data, which you can access by first checking out the
+required `revision` using Git followed by `dvc checkout` to update DVC tracked
+data files/dir:
+
+```dvc
+$ git checkout 95485f   # Git commit of required data version
+$ dvc checkout
+```
+
+If your dataset consist of multiple files like images, etc. then the best way to
+track whole directory is with single `.dvc` file. You can use `dvc add` with
+relative path to directory:
+
+```dvc
+$ dvc add data/images
+```
+
+## Experiments and tracking parameters
+
+You can use DVC for tuning [parameters](doc/command-reference/params), improving
+target [metrics](doc/command-reference/metrics) and visualizing the changes with
+[plots](doc/command-reference/plots). In the first step tune parameters in
+default `params.yaml` file and reproduce the pipeline:
+
+```dvc
+$ dvc repro        # Reproducing pipeline
+$ git add -am "Epoch Experiment"
+```
+
+Commit the new changes in files using Git. Next step is to compare the
+experiments. Use `dvc metrics` to find difference in target metric between two
+commits:
+
+```dvc
+$ dvc metrics diff rev1 rev2
+```
+
+And finally you can plot target metrics using `dvc plots`:
+
+```dvc
+$ dvc plots diff -x recall -y precision rev1 rev2
+```
+
+If you want to recover a model from last week without wasting time required for
+the model to retrain you can use DVC to navigate through your experiments. First
+you can checkout the required `revision` using Git:
+
+```dvc
+$ git checkout baseline-experiment   # Git commit, tag or branch
+$ dvc checkout
+```
+
+Followed by `dvc checkout` to update DVC-tracked files and directories in your
+workspace.
+
+## Reproducibility
+
+You can run a model's evaluation process again without actually retraining the
+model and preprocessing a raw dataset. DVC provides a way to reproduce pipelines
+partially. You can use `dvc repro` to execute evaluation stage without
+reproducing complete pipeline:
+
+```dvc
+$ dvc repro evaluate
+```
+
+## Managing and sharing large data files
+
+Cloud or local storage can be used to store the project's data. You can share
+the entire 147 GB of your ML project, with all of its data sources, intermediate
+data files, and models with others if they are stored on
+[remote storage](doc/command-reference/remote/add#supported-storage-types).
+Using this you can share models trained in a GPU environment with colleagues who
+don't have access to a GPU. Have a look at this
+[example](doc/command-reference/pull#example-download-from-specific-remote-storage)
+to see how this works.
+
+## Manually editing dvc.yaml or .dvc files
+
+It's safe to edit `dvc.yaml` and `.dvc` files. Here's a `dvc.yaml` example:
+
+```yaml
+stages:
+  prepare:
+    cmd: python src/prepare.py data/data.xml
+    deps:
+      - data/data.xml
+    params:
+      - prepare.split
+    outs:
+      - data/prepared
+```
+
+You can manually edit all the fields present in `dvc.yaml`. However, in `.dvc`
+files please remember not to change the `md5` or `checksum` fields as they
+contain hash values which DVC uses to track the file or directory.
+
+## Never store credentials in project config
+
+Do not store any user credentials in project config file. This file can be found
+by default in `.dvc/config`. Use `--local`, `--global`, or `--system` command
+options with `dvc config` for storing sensitive, or user-specific settings:
+
+```dvc
+$ dvc config --system remote.username [password]
+```
+
+## Tracking <abbr>outputs</abbr> by Git
+
+If your `output` files are small in size and you want to track them with Git
+then you can use `--outs-no-cache` option to define outputs while creating or
+modifying a stage. DVC will not track will not track outputs in this case:
+
+```dvc
+$ dvc run -n train -d src/train.py -d data/features \
+          ---outs-no-cache model.p \
+          python src/train.py data/features model.pkl
+```
diff --git a/content/docs/user-guide/tips-and-tricks.md b/content/docs/user-guide/tips-and-tricks.md
new file mode 100644
index 0000000000..8bad7f053e
--- /dev/null
+++ b/content/docs/user-guide/tips-and-tricks.md
@@ -0,0 +1,17 @@
+# Tips and tricks for DVC Projects
+
+This guide provides general tips and tricks related to DVC, which can be
+utilized while working on a project. Using the practices listed here, you can
+manage your projects with DVC more efficiently.
+
+## Using meta in dvc.yaml or .dvc files
+
+DVC provides an optional `meta` field in `dvc.yaml` and `.dvc` file. It can be
+used to add any user specific information. It also supports YAML content.
+
+## Switching between datasets
+
+You can quickly switch between a large dataset and a small subset without
+modifying source code. To achieve this you need to change dependencies of
+relevant stage either by using `dvc run` with the `-f` option or by manually
+editing the stage in `dvc.yaml` file.

From 887f2c1e3917952448b45809938b3db0859fef0b Mon Sep 17 00:00:00 2001
From: imhardikj <imhardikj@gmail.com>
Date: Tue, 1 Sep 2020 18:41:07 +0530
Subject: [PATCH 2/4] updates

---
 content/docs/sidebar.json                 | 1 +
 content/docs/user-guide/best-practices.md | 4 ++++
 2 files changed, 5 insertions(+)

diff --git a/content/docs/sidebar.json b/content/docs/sidebar.json
index de77dfde71..9f18c745e3 100644
--- a/content/docs/sidebar.json
+++ b/content/docs/sidebar.json
@@ -91,6 +91,7 @@
         "label": "DVC Files and Directories",
         "slug": "dvc-files-and-directories"
       },
+      "best-practices",
       "merge-conflicts",
       {
         "slug": "dvcignore",
diff --git a/content/docs/user-guide/best-practices.md b/content/docs/user-guide/best-practices.md
index ea88597044..a2f99ef0c9 100644
--- a/content/docs/user-guide/best-practices.md
+++ b/content/docs/user-guide/best-practices.md
@@ -66,6 +66,10 @@ $ dvc checkout
 Followed by `dvc checkout` to update DVC-tracked files and directories in your
 workspace.
 
+If you are training different models on your data files in the same project,
+using Git commits, tags, or branches makes it easy to manage the project. Have a
+look at this [example]() to see how this works.
+
 ## Reproducibility
 
 You can run a model's evaluation process again without actually retraining the

From c30116b384ce36ed66812287f17aaadea3b9c7f9 Mon Sep 17 00:00:00 2001
From: Jorge Orpinel <jorge@orpinel.com>
Date: Sun, 20 Sep 2020 02:47:50 -0400
Subject: [PATCH 3/4] guide: review Best Practices and tips&tricks so far...

---
 content/docs/user-guide/best-practices.md  | 170 ++++++++++-----------
 content/docs/user-guide/tips-and-tricks.md |  45 ++++--
 2 files changed, 118 insertions(+), 97 deletions(-)

diff --git a/content/docs/user-guide/best-practices.md b/content/docs/user-guide/best-practices.md
index a2f99ef0c9..bbd105ec33 100644
--- a/content/docs/user-guide/best-practices.md
+++ b/content/docs/user-guide/best-practices.md
@@ -1,135 +1,133 @@
 # Best Practices for DVC Projects
 
 DVC provides a systematic approach towards managing and collaborating on data
-science projects. You can manage your projects with DVC more efficiently using
-the practices listed here:
+science projects. Here are a few recommended practices to organize your workflow
+and project structure effectively:
 
-## Source code and data versioning
+> See also these quick [tips & tricks](/doc/user-guide/tips-and-tricks).
 
-You can use DVC to avoid discrepancies between
+## Matching source code to data
+
+One of DVC's basic uses is to avoid a disconnection between
 [revisions](https://git-scm.com/docs/revisions) of source code and
-[versions](/doc/use-cases/versioning-data-and-model-files) of data files. DVC
-replaces all large data files, models, etc. with small
-[metafiles](doc/user-guide/dvc-files-and-directories) (tracked by Git). These
-files point to the original data, which you can access by first checking out the
-required `revision` using Git followed by `dvc checkout` to update DVC tracked
-data files/dir:
+[versions](/doc/use-cases/versioning-data-and-model-files) of data. DVC replaces
+large data files and directories, models, etc. with small
+[metafiles](/doc/user-guide/dvc-files-and-directories), which you can track with
+Git, along with the corresponding code.
+
+These metafiles point to the original data, which is <abbr>cached</abbr>
+automatically. You can access it later by restoring that Git working tree (e.g.
+with `git checkout`) and using `dvc checkout` to update DVC tracked data
+files/dir:
 
 ```dvc
-$ git checkout 95485f   # Git commit of required data version
+$ git checkout 95485f  # Git commit of a desired project version
 $ dvc checkout
 ```
 
+> See
+> [Versioning Data and Model Files](/doc/use-cases/versioning-data-and-model-files)
+> for more details.
+
+## Using directories as single data units
+
 If your dataset consist of multiple files like images, etc. then the best way to
-track whole directory is with single `.dvc` file. You can use `dvc add` with
-relative path to directory:
+track it is
+[as a directory](/doc/command-reference/add#adding-entire-directories), with a
+single `.dvc` file:
 
 ```dvc
-$ dvc add data/images
+$ dvc add data/images/
 ```
 
-## Experiments and tracking parameters
+## Manually editing dvc.yaml or .dvc files
 
-You can use DVC for tuning [parameters](doc/command-reference/params), improving
-target [metrics](doc/command-reference/metrics) and visualizing the changes with
-[plots](doc/command-reference/plots). In the first step tune parameters in
-default `params.yaml` file and reproduce the pipeline:
+It's safe to edit `dvc.yaml` and `.dvc` files. Here's a `dvc.yaml` example:
 
-```dvc
-$ dvc repro        # Reproducing pipeline
-$ git add -am "Epoch Experiment"
+```yaml
+stages:
+  prepare:
+    cmd: python src/prepare.py data/data.xml
+    deps:
+      - data/data.xml
+    params:
+      - prepare.split
+    outs:
+      - data/prepared
 ```
 
-Commit the new changes in files using Git. Next step is to compare the
-experiments. Use `dvc metrics` to find difference in target metric between two
-commits:
+You can manually edit all the fields present in `dvc.yaml`. However, in `.dvc`
+files please remember not to change the `md5` or `checksum` fields as they
+contain hash values which DVC uses to track the file or directory.
 
-```dvc
-$ dvc metrics diff rev1 rev2
-```
+## Managing and sharing large data
 
-And finally you can plot target metrics using `dvc plots`:
+Traditional or cloud storage can be used to store the project's data. You can
+share the entire 147 GB of your ML project, with all of its data sources,
+intermediate data files, and models with others by setting up DVC
+[remote storage](doc/command-reference/remote) (optional).
 
-```dvc
-$ dvc plots diff -x recall -y precision rev1 rev2
-```
+This way you can share models trained in a GPU environment with colleagues who
+don't have access to GPUs.
+
+## Never store secrets in the shared config file
 
-If you want to recover a model from last week without wasting time required for
-the model to retrain you can use DVC to navigate through your experiments. First
-you can checkout the required `revision` using Git:
+Do not put user credentials in the default config file (`.dvc/config`), which is
+tracked by Git. Use the `--local`, `--global`, or `--system` options of
+`dvc config` to provide sensitive or user-specific settings:
 
 ```dvc
-$ git checkout baseline-experiment   # Git commit, tag or branch
-$ dvc checkout
+$ dvc config --local remote.password mypassword  # just here
+$ dvc config --global core.checksum_jobs 16      # all my projest
+$ dvc config --system core.check_update false    # all users
 ```
 
-Followed by `dvc checkout` to update DVC-tracked files and directories in your
-workspace.
+## Tracking experiments with Git
 
 If you are training different models on your data files in the same project,
-using Git commits, tags, or branches makes it easy to manage the project. Have a
-look at this [example]() to see how this works.
+using Git commits, tags, or branches makes it easy to manage the project.
 
-## Reproducibility
+<!-- TODO: needs much elaboration! -->
 
-You can run a model's evaluation process again without actually retraining the
-model and preprocessing a raw dataset. DVC provides a way to reproduce pipelines
-partially. You can use `dvc repro` to execute evaluation stage without
-reproducing complete pipeline:
+## Basic experimentation flow
 
-```dvc
-$ dvc repro evaluate
-```
+Use DVC for [reproducing](/doc/command-reference/repro) experiments after tuning
+their [parameters](/doc/command-reference/params), tracking resulting
+[metrics](/doc/command-reference/metrics), and visualizing their evolution with
+[plots](/doc/command-reference/plots).
 
-## Managing and sharing large data files
+For example, let's first setup some parameters in `params.yaml` and reproduce
+the pipeline:
 
-Cloud or local storage can be used to store the project's data. You can share
-the entire 147 GB of your ML project, with all of its data sources, intermediate
-data files, and models with others if they are stored on
-[remote storage](doc/command-reference/remote/add#supported-storage-types).
-Using this you can share models trained in a GPU environment with colleagues who
-don't have access to a GPU. Have a look at this
-[example](doc/command-reference/pull#example-download-from-specific-remote-storage)
-to see how this works.
+<!-- TODO: sample params file -->
 
-## Manually editing dvc.yaml or .dvc files
+```dvc
+$ dvc repro
+```
 
-It's safe to edit `dvc.yaml` and `.dvc` files. Here's a `dvc.yaml` example:
+<!-- TODO: what about the command output above? -->
 
-```yaml
-stages:
-  prepare:
-    cmd: python src/prepare.py data/data.xml
-    deps:
-      - data/data.xml
-    params:
-      - prepare.split
-    outs:
-      - data/prepared
-```
+Commit the changes using Git. Having some commits allows us to compare the
+experiments using `dvc metrics diff`:
 
-You can manually edit all the fields present in `dvc.yaml`. However, in `.dvc`
-files please remember not to change the `md5` or `checksum` fields as they
-contain hash values which DVC uses to track the file or directory.
+```dvc
+$ dvc metrics diff rev1 rev2
+```
 
-## Never store credentials in project config
+<!-- TODO: command output above? -->
 
-Do not store any user credentials in project config file. This file can be found
-by default in `.dvc/config`. Use `--local`, `--global`, or `--system` command
-options with `dvc config` for storing sensitive, or user-specific settings:
+Finally, you can see how certain metrics evolved using `dvc plots diff`:
 
 ```dvc
-$ dvc config --system remote.username [password]
+$ dvc plots diff -x recall -y precision rev1 rev2
 ```
 
-## Tracking <abbr>outputs</abbr> by Git
+<!-- TODO: insert plot img -->
 
-If your `output` files are small in size and you want to track them with Git
-then you can use `--outs-no-cache` option to define outputs while creating or
-modifying a stage. DVC will not track will not track outputs in this case:
+If you want to recover a model from last week without wasting time required to
+retrain the model, you can use Git and DVC to navigate through your experiments:
 
 ```dvc
-$ dvc run -n train -d src/train.py -d data/features \
-          ---outs-no-cache model.p \
-          python src/train.py data/features model.pkl
+$ git checkout baseline-experiment   # Git commit, tag or branch
+$ dvc checkout
 ```
diff --git a/content/docs/user-guide/tips-and-tricks.md b/content/docs/user-guide/tips-and-tricks.md
index 8bad7f053e..a6cefa003a 100644
--- a/content/docs/user-guide/tips-and-tricks.md
+++ b/content/docs/user-guide/tips-and-tricks.md
@@ -1,17 +1,40 @@
 # Tips and tricks for DVC Projects
 
-This guide provides general tips and tricks related to DVC, which can be
-utilized while working on a project. Using the practices listed here, you can
-manage your projects with DVC more efficiently.
-
-## Using meta in dvc.yaml or .dvc files
-
-DVC provides an optional `meta` field in `dvc.yaml` and `.dvc` file. It can be
-used to add any user specific information. It also supports YAML content.
+Using the methods listed here, you can manage your DVC projects more
+efficiently.
 
 ## Switching between datasets
 
 You can quickly switch between a large dataset and a small subset without
-modifying source code. To achieve this you need to change dependencies of
-relevant stage either by using `dvc run` with the `-f` option or by manually
-editing the stage in `dvc.yaml` file.
+modifying source code: Change the dependencies of stage, either by manually
+editing the stage in `dvc.yaml` or by using `dvc run` again with `-f`.
+
+<!-- TODO: needs actual example -->
+
+## Tracking small data with Git
+
+If your `output` files are small in size and you want to track them with Git
+then you can use `--outs-no-cache` option to define outputs while creating or
+modifying a stage. DVC will not track will not track outputs in this case:
+
+```dvc
+$ dvc run -n train -d src/train.py -d data/features \
+          ---outs-no-cache model.p \
+          python src/train.py data/features model.pkl
+```
+
+## Partial reproducibility
+
+You can run a model's evaluation process again without preprocessing a raw
+dataset again, or retraining the model. Pass a target stage to `dvc repro` to
+execute only the necessary parts of the pipeline:
+
+```dvc
+$ dvc repro evaluate
+```
+
+## User metadata in DVC metafiles
+
+DVC provides an optional `meta` field for `dvc.yaml` and `.dvc` metafiles
+(that's very meta!). It can be used to add any user information (as YAML content
+e.g. `"a string"`).

From 6d61cc404580965e2de9ac055ff49d386bd9dd71 Mon Sep 17 00:00:00 2001
From: Jorge Orpinel <jorge@orpinel.com>
Date: Sun, 20 Sep 2020 02:57:13 -0400
Subject: [PATCH 4/4] guide: remove experimentation flow section (too
 incomplete) from Best Practices

---
 content/docs/user-guide/best-practices.md | 45 -----------------------
 1 file changed, 45 deletions(-)

diff --git a/content/docs/user-guide/best-practices.md b/content/docs/user-guide/best-practices.md
index bbd105ec33..8e3923fae3 100644
--- a/content/docs/user-guide/best-practices.md
+++ b/content/docs/user-guide/best-practices.md
@@ -86,48 +86,3 @@ $ dvc config --system core.check_update false    # all users
 
 If you are training different models on your data files in the same project,
 using Git commits, tags, or branches makes it easy to manage the project.
-
-<!-- TODO: needs much elaboration! -->
-
-## Basic experimentation flow
-
-Use DVC for [reproducing](/doc/command-reference/repro) experiments after tuning
-their [parameters](/doc/command-reference/params), tracking resulting
-[metrics](/doc/command-reference/metrics), and visualizing their evolution with
-[plots](/doc/command-reference/plots).
-
-For example, let's first setup some parameters in `params.yaml` and reproduce
-the pipeline:
-
-<!-- TODO: sample params file -->
-
-```dvc
-$ dvc repro
-```
-
-<!-- TODO: what about the command output above? -->
-
-Commit the changes using Git. Having some commits allows us to compare the
-experiments using `dvc metrics diff`:
-
-```dvc
-$ dvc metrics diff rev1 rev2
-```
-
-<!-- TODO: command output above? -->
-
-Finally, you can see how certain metrics evolved using `dvc plots diff`:
-
-```dvc
-$ dvc plots diff -x recall -y precision rev1 rev2
-```
-
-<!-- TODO: insert plot img -->
-
-If you want to recover a model from last week without wasting time required to
-retrain the model, you can use Git and DVC to navigate through your experiments:
-
-```dvc
-$ git checkout baseline-experiment   # Git commit, tag or branch
-$ dvc checkout
-```