From 5e4c5fb8771e2ff4d5c977a11b8d1e5f73bd1fa7 Mon Sep 17 00:00:00 2001 From: imhardikj Date: Tue, 1 Sep 2020 17:07:15 +0530 Subject: [PATCH 1/4] Best practices guide --- content/docs/user-guide/best-practices.md | 131 +++++++++++++++++++++ content/docs/user-guide/tips-and-tricks.md | 17 +++ 2 files changed, 148 insertions(+) create mode 100644 content/docs/user-guide/best-practices.md create mode 100644 content/docs/user-guide/tips-and-tricks.md diff --git a/content/docs/user-guide/best-practices.md b/content/docs/user-guide/best-practices.md new file mode 100644 index 0000000000..ea88597044 --- /dev/null +++ b/content/docs/user-guide/best-practices.md @@ -0,0 +1,131 @@ +# Best Practices for DVC Projects + +DVC provides a systematic approach towards managing and collaborating on data +science projects. You can manage your projects with DVC more efficiently using +the practices listed here: + +## Source code and data versioning + +You can use DVC to avoid discrepancies between +[revisions](https://git-scm.com/docs/revisions) of source code and +[versions](/doc/use-cases/versioning-data-and-model-files) of data files. DVC +replaces all large data files, models, etc. with small +[metafiles](doc/user-guide/dvc-files-and-directories) (tracked by Git). These +files point to the original data, which you can access by first checking out the +required `revision` using Git followed by `dvc checkout` to update DVC tracked +data files/dir: + +```dvc +$ git checkout 95485f # Git commit of required data version +$ dvc checkout +``` + +If your dataset consist of multiple files like images, etc. then the best way to +track whole directory is with single `.dvc` file. You can use `dvc add` with +relative path to directory: + +```dvc +$ dvc add data/images +``` + +## Experiments and tracking parameters + +You can use DVC for tuning [parameters](doc/command-reference/params), improving +target [metrics](doc/command-reference/metrics) and visualizing the changes with +[plots](doc/command-reference/plots). In the first step tune parameters in +default `params.yaml` file and reproduce the pipeline: + +```dvc +$ dvc repro # Reproducing pipeline +$ git add -am "Epoch Experiment" +``` + +Commit the new changes in files using Git. Next step is to compare the +experiments. Use `dvc metrics` to find difference in target metric between two +commits: + +```dvc +$ dvc metrics diff rev1 rev2 +``` + +And finally you can plot target metrics using `dvc plots`: + +```dvc +$ dvc plots diff -x recall -y precision rev1 rev2 +``` + +If you want to recover a model from last week without wasting time required for +the model to retrain you can use DVC to navigate through your experiments. First +you can checkout the required `revision` using Git: + +```dvc +$ git checkout baseline-experiment # Git commit, tag or branch +$ dvc checkout +``` + +Followed by `dvc checkout` to update DVC-tracked files and directories in your +workspace. + +## Reproducibility + +You can run a model's evaluation process again without actually retraining the +model and preprocessing a raw dataset. DVC provides a way to reproduce pipelines +partially. You can use `dvc repro` to execute evaluation stage without +reproducing complete pipeline: + +```dvc +$ dvc repro evaluate +``` + +## Managing and sharing large data files + +Cloud or local storage can be used to store the project's data. You can share +the entire 147 GB of your ML project, with all of its data sources, intermediate +data files, and models with others if they are stored on +[remote storage](doc/command-reference/remote/add#supported-storage-types). +Using this you can share models trained in a GPU environment with colleagues who +don't have access to a GPU. Have a look at this +[example](doc/command-reference/pull#example-download-from-specific-remote-storage) +to see how this works. + +## Manually editing dvc.yaml or .dvc files + +It's safe to edit `dvc.yaml` and `.dvc` files. Here's a `dvc.yaml` example: + +```yaml +stages: + prepare: + cmd: python src/prepare.py data/data.xml + deps: + - data/data.xml + params: + - prepare.split + outs: + - data/prepared +``` + +You can manually edit all the fields present in `dvc.yaml`. However, in `.dvc` +files please remember not to change the `md5` or `checksum` fields as they +contain hash values which DVC uses to track the file or directory. + +## Never store credentials in project config + +Do not store any user credentials in project config file. This file can be found +by default in `.dvc/config`. Use `--local`, `--global`, or `--system` command +options with `dvc config` for storing sensitive, or user-specific settings: + +```dvc +$ dvc config --system remote.username [password] +``` + +## Tracking outputs by Git + +If your `output` files are small in size and you want to track them with Git +then you can use `--outs-no-cache` option to define outputs while creating or +modifying a stage. DVC will not track will not track outputs in this case: + +```dvc +$ dvc run -n train -d src/train.py -d data/features \ + ---outs-no-cache model.p \ + python src/train.py data/features model.pkl +``` diff --git a/content/docs/user-guide/tips-and-tricks.md b/content/docs/user-guide/tips-and-tricks.md new file mode 100644 index 0000000000..8bad7f053e --- /dev/null +++ b/content/docs/user-guide/tips-and-tricks.md @@ -0,0 +1,17 @@ +# Tips and tricks for DVC Projects + +This guide provides general tips and tricks related to DVC, which can be +utilized while working on a project. Using the practices listed here, you can +manage your projects with DVC more efficiently. + +## Using meta in dvc.yaml or .dvc files + +DVC provides an optional `meta` field in `dvc.yaml` and `.dvc` file. It can be +used to add any user specific information. It also supports YAML content. + +## Switching between datasets + +You can quickly switch between a large dataset and a small subset without +modifying source code. To achieve this you need to change dependencies of +relevant stage either by using `dvc run` with the `-f` option or by manually +editing the stage in `dvc.yaml` file. From 887f2c1e3917952448b45809938b3db0859fef0b Mon Sep 17 00:00:00 2001 From: imhardikj Date: Tue, 1 Sep 2020 18:41:07 +0530 Subject: [PATCH 2/4] updates --- content/docs/sidebar.json | 1 + content/docs/user-guide/best-practices.md | 4 ++++ 2 files changed, 5 insertions(+) diff --git a/content/docs/sidebar.json b/content/docs/sidebar.json index de77dfde71..9f18c745e3 100644 --- a/content/docs/sidebar.json +++ b/content/docs/sidebar.json @@ -91,6 +91,7 @@ "label": "DVC Files and Directories", "slug": "dvc-files-and-directories" }, + "best-practices", "merge-conflicts", { "slug": "dvcignore", diff --git a/content/docs/user-guide/best-practices.md b/content/docs/user-guide/best-practices.md index ea88597044..a2f99ef0c9 100644 --- a/content/docs/user-guide/best-practices.md +++ b/content/docs/user-guide/best-practices.md @@ -66,6 +66,10 @@ $ dvc checkout Followed by `dvc checkout` to update DVC-tracked files and directories in your workspace. +If you are training different models on your data files in the same project, +using Git commits, tags, or branches makes it easy to manage the project. Have a +look at this [example]() to see how this works. + ## Reproducibility You can run a model's evaluation process again without actually retraining the From c30116b384ce36ed66812287f17aaadea3b9c7f9 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 20 Sep 2020 02:47:50 -0400 Subject: [PATCH 3/4] guide: review Best Practices and tips&tricks so far... --- content/docs/user-guide/best-practices.md | 170 ++++++++++----------- content/docs/user-guide/tips-and-tricks.md | 45 ++++-- 2 files changed, 118 insertions(+), 97 deletions(-) diff --git a/content/docs/user-guide/best-practices.md b/content/docs/user-guide/best-practices.md index a2f99ef0c9..bbd105ec33 100644 --- a/content/docs/user-guide/best-practices.md +++ b/content/docs/user-guide/best-practices.md @@ -1,135 +1,133 @@ # Best Practices for DVC Projects DVC provides a systematic approach towards managing and collaborating on data -science projects. You can manage your projects with DVC more efficiently using -the practices listed here: +science projects. Here are a few recommended practices to organize your workflow +and project structure effectively: -## Source code and data versioning +> See also these quick [tips & tricks](/doc/user-guide/tips-and-tricks). -You can use DVC to avoid discrepancies between +## Matching source code to data + +One of DVC's basic uses is to avoid a disconnection between [revisions](https://git-scm.com/docs/revisions) of source code and -[versions](/doc/use-cases/versioning-data-and-model-files) of data files. DVC -replaces all large data files, models, etc. with small -[metafiles](doc/user-guide/dvc-files-and-directories) (tracked by Git). These -files point to the original data, which you can access by first checking out the -required `revision` using Git followed by `dvc checkout` to update DVC tracked -data files/dir: +[versions](/doc/use-cases/versioning-data-and-model-files) of data. DVC replaces +large data files and directories, models, etc. with small +[metafiles](/doc/user-guide/dvc-files-and-directories), which you can track with +Git, along with the corresponding code. + +These metafiles point to the original data, which is cached +automatically. You can access it later by restoring that Git working tree (e.g. +with `git checkout`) and using `dvc checkout` to update DVC tracked data +files/dir: ```dvc -$ git checkout 95485f # Git commit of required data version +$ git checkout 95485f # Git commit of a desired project version $ dvc checkout ``` +> See +> [Versioning Data and Model Files](/doc/use-cases/versioning-data-and-model-files) +> for more details. + +## Using directories as single data units + If your dataset consist of multiple files like images, etc. then the best way to -track whole directory is with single `.dvc` file. You can use `dvc add` with -relative path to directory: +track it is +[as a directory](/doc/command-reference/add#adding-entire-directories), with a +single `.dvc` file: ```dvc -$ dvc add data/images +$ dvc add data/images/ ``` -## Experiments and tracking parameters +## Manually editing dvc.yaml or .dvc files -You can use DVC for tuning [parameters](doc/command-reference/params), improving -target [metrics](doc/command-reference/metrics) and visualizing the changes with -[plots](doc/command-reference/plots). In the first step tune parameters in -default `params.yaml` file and reproduce the pipeline: +It's safe to edit `dvc.yaml` and `.dvc` files. Here's a `dvc.yaml` example: -```dvc -$ dvc repro # Reproducing pipeline -$ git add -am "Epoch Experiment" +```yaml +stages: + prepare: + cmd: python src/prepare.py data/data.xml + deps: + - data/data.xml + params: + - prepare.split + outs: + - data/prepared ``` -Commit the new changes in files using Git. Next step is to compare the -experiments. Use `dvc metrics` to find difference in target metric between two -commits: +You can manually edit all the fields present in `dvc.yaml`. However, in `.dvc` +files please remember not to change the `md5` or `checksum` fields as they +contain hash values which DVC uses to track the file or directory. -```dvc -$ dvc metrics diff rev1 rev2 -``` +## Managing and sharing large data -And finally you can plot target metrics using `dvc plots`: +Traditional or cloud storage can be used to store the project's data. You can +share the entire 147 GB of your ML project, with all of its data sources, +intermediate data files, and models with others by setting up DVC +[remote storage](doc/command-reference/remote) (optional). -```dvc -$ dvc plots diff -x recall -y precision rev1 rev2 -``` +This way you can share models trained in a GPU environment with colleagues who +don't have access to GPUs. + +## Never store secrets in the shared config file -If you want to recover a model from last week without wasting time required for -the model to retrain you can use DVC to navigate through your experiments. First -you can checkout the required `revision` using Git: +Do not put user credentials in the default config file (`.dvc/config`), which is +tracked by Git. Use the `--local`, `--global`, or `--system` options of +`dvc config` to provide sensitive or user-specific settings: ```dvc -$ git checkout baseline-experiment # Git commit, tag or branch -$ dvc checkout +$ dvc config --local remote.password mypassword # just here +$ dvc config --global core.checksum_jobs 16 # all my projest +$ dvc config --system core.check_update false # all users ``` -Followed by `dvc checkout` to update DVC-tracked files and directories in your -workspace. +## Tracking experiments with Git If you are training different models on your data files in the same project, -using Git commits, tags, or branches makes it easy to manage the project. Have a -look at this [example]() to see how this works. +using Git commits, tags, or branches makes it easy to manage the project. -## Reproducibility + -You can run a model's evaluation process again without actually retraining the -model and preprocessing a raw dataset. DVC provides a way to reproduce pipelines -partially. You can use `dvc repro` to execute evaluation stage without -reproducing complete pipeline: +## Basic experimentation flow -```dvc -$ dvc repro evaluate -``` +Use DVC for [reproducing](/doc/command-reference/repro) experiments after tuning +their [parameters](/doc/command-reference/params), tracking resulting +[metrics](/doc/command-reference/metrics), and visualizing their evolution with +[plots](/doc/command-reference/plots). -## Managing and sharing large data files +For example, let's first setup some parameters in `params.yaml` and reproduce +the pipeline: -Cloud or local storage can be used to store the project's data. You can share -the entire 147 GB of your ML project, with all of its data sources, intermediate -data files, and models with others if they are stored on -[remote storage](doc/command-reference/remote/add#supported-storage-types). -Using this you can share models trained in a GPU environment with colleagues who -don't have access to a GPU. Have a look at this -[example](doc/command-reference/pull#example-download-from-specific-remote-storage) -to see how this works. + -## Manually editing dvc.yaml or .dvc files +```dvc +$ dvc repro +``` -It's safe to edit `dvc.yaml` and `.dvc` files. Here's a `dvc.yaml` example: + -```yaml -stages: - prepare: - cmd: python src/prepare.py data/data.xml - deps: - - data/data.xml - params: - - prepare.split - outs: - - data/prepared -``` +Commit the changes using Git. Having some commits allows us to compare the +experiments using `dvc metrics diff`: -You can manually edit all the fields present in `dvc.yaml`. However, in `.dvc` -files please remember not to change the `md5` or `checksum` fields as they -contain hash values which DVC uses to track the file or directory. +```dvc +$ dvc metrics diff rev1 rev2 +``` -## Never store credentials in project config + -Do not store any user credentials in project config file. This file can be found -by default in `.dvc/config`. Use `--local`, `--global`, or `--system` command -options with `dvc config` for storing sensitive, or user-specific settings: +Finally, you can see how certain metrics evolved using `dvc plots diff`: ```dvc -$ dvc config --system remote.username [password] +$ dvc plots diff -x recall -y precision rev1 rev2 ``` -## Tracking outputs by Git + -If your `output` files are small in size and you want to track them with Git -then you can use `--outs-no-cache` option to define outputs while creating or -modifying a stage. DVC will not track will not track outputs in this case: +If you want to recover a model from last week without wasting time required to +retrain the model, you can use Git and DVC to navigate through your experiments: ```dvc -$ dvc run -n train -d src/train.py -d data/features \ - ---outs-no-cache model.p \ - python src/train.py data/features model.pkl +$ git checkout baseline-experiment # Git commit, tag or branch +$ dvc checkout ``` diff --git a/content/docs/user-guide/tips-and-tricks.md b/content/docs/user-guide/tips-and-tricks.md index 8bad7f053e..a6cefa003a 100644 --- a/content/docs/user-guide/tips-and-tricks.md +++ b/content/docs/user-guide/tips-and-tricks.md @@ -1,17 +1,40 @@ # Tips and tricks for DVC Projects -This guide provides general tips and tricks related to DVC, which can be -utilized while working on a project. Using the practices listed here, you can -manage your projects with DVC more efficiently. - -## Using meta in dvc.yaml or .dvc files - -DVC provides an optional `meta` field in `dvc.yaml` and `.dvc` file. It can be -used to add any user specific information. It also supports YAML content. +Using the methods listed here, you can manage your DVC projects more +efficiently. ## Switching between datasets You can quickly switch between a large dataset and a small subset without -modifying source code. To achieve this you need to change dependencies of -relevant stage either by using `dvc run` with the `-f` option or by manually -editing the stage in `dvc.yaml` file. +modifying source code: Change the dependencies of stage, either by manually +editing the stage in `dvc.yaml` or by using `dvc run` again with `-f`. + + + +## Tracking small data with Git + +If your `output` files are small in size and you want to track them with Git +then you can use `--outs-no-cache` option to define outputs while creating or +modifying a stage. DVC will not track will not track outputs in this case: + +```dvc +$ dvc run -n train -d src/train.py -d data/features \ + ---outs-no-cache model.p \ + python src/train.py data/features model.pkl +``` + +## Partial reproducibility + +You can run a model's evaluation process again without preprocessing a raw +dataset again, or retraining the model. Pass a target stage to `dvc repro` to +execute only the necessary parts of the pipeline: + +```dvc +$ dvc repro evaluate +``` + +## User metadata in DVC metafiles + +DVC provides an optional `meta` field for `dvc.yaml` and `.dvc` metafiles +(that's very meta!). It can be used to add any user information (as YAML content +e.g. `"a string"`). From 6d61cc404580965e2de9ac055ff49d386bd9dd71 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 20 Sep 2020 02:57:13 -0400 Subject: [PATCH 4/4] guide: remove experimentation flow section (too incomplete) from Best Practices --- content/docs/user-guide/best-practices.md | 45 ----------------------- 1 file changed, 45 deletions(-) diff --git a/content/docs/user-guide/best-practices.md b/content/docs/user-guide/best-practices.md index bbd105ec33..8e3923fae3 100644 --- a/content/docs/user-guide/best-practices.md +++ b/content/docs/user-guide/best-practices.md @@ -86,48 +86,3 @@ $ dvc config --system core.check_update false # all users If you are training different models on your data files in the same project, using Git commits, tags, or branches makes it easy to manage the project. - - - -## Basic experimentation flow - -Use DVC for [reproducing](/doc/command-reference/repro) experiments after tuning -their [parameters](/doc/command-reference/params), tracking resulting -[metrics](/doc/command-reference/metrics), and visualizing their evolution with -[plots](/doc/command-reference/plots). - -For example, let's first setup some parameters in `params.yaml` and reproduce -the pipeline: - - - -```dvc -$ dvc repro -``` - - - -Commit the changes using Git. Having some commits allows us to compare the -experiments using `dvc metrics diff`: - -```dvc -$ dvc metrics diff rev1 rev2 -``` - - - -Finally, you can see how certain metrics evolved using `dvc plots diff`: - -```dvc -$ dvc plots diff -x recall -y precision rev1 rev2 -``` - - - -If you want to recover a model from last week without wasting time required to -retrain the model, you can use Git and DVC to navigate through your experiments: - -```dvc -$ git checkout baseline-experiment # Git commit, tag or branch -$ dvc checkout -```