diff --git a/content/docs/sidebar.json b/content/docs/sidebar.json index de77dfde71..9f18c745e3 100644 --- a/content/docs/sidebar.json +++ b/content/docs/sidebar.json @@ -91,6 +91,7 @@ "label": "DVC Files and Directories", "slug": "dvc-files-and-directories" }, + "best-practices", "merge-conflicts", { "slug": "dvcignore", diff --git a/content/docs/user-guide/best-practices.md b/content/docs/user-guide/best-practices.md new file mode 100644 index 0000000000..8e3923fae3 --- /dev/null +++ b/content/docs/user-guide/best-practices.md @@ -0,0 +1,88 @@ +# Best Practices for DVC Projects + +DVC provides a systematic approach towards managing and collaborating on data +science projects. Here are a few recommended practices to organize your workflow +and project structure effectively: + +> See also these quick [tips & tricks](/doc/user-guide/tips-and-tricks). + +## Matching source code to data + +One of DVC's basic uses is to avoid a disconnection between +[revisions](https://git-scm.com/docs/revisions) of source code and +[versions](/doc/use-cases/versioning-data-and-model-files) of data. DVC replaces +large data files and directories, models, etc. with small +[metafiles](/doc/user-guide/dvc-files-and-directories), which you can track with +Git, along with the corresponding code. + +These metafiles point to the original data, which is cached +automatically. You can access it later by restoring that Git working tree (e.g. +with `git checkout`) and using `dvc checkout` to update DVC tracked data +files/dir: + +```dvc +$ git checkout 95485f # Git commit of a desired project version +$ dvc checkout +``` + +> See +> [Versioning Data and Model Files](/doc/use-cases/versioning-data-and-model-files) +> for more details. + +## Using directories as single data units + +If your dataset consist of multiple files like images, etc. then the best way to +track it is +[as a directory](/doc/command-reference/add#adding-entire-directories), with a +single `.dvc` file: + +```dvc +$ dvc add data/images/ +``` + +## Manually editing dvc.yaml or .dvc files + +It's safe to edit `dvc.yaml` and `.dvc` files. Here's a `dvc.yaml` example: + +```yaml +stages: + prepare: + cmd: python src/prepare.py data/data.xml + deps: + - data/data.xml + params: + - prepare.split + outs: + - data/prepared +``` + +You can manually edit all the fields present in `dvc.yaml`. However, in `.dvc` +files please remember not to change the `md5` or `checksum` fields as they +contain hash values which DVC uses to track the file or directory. + +## Managing and sharing large data + +Traditional or cloud storage can be used to store the project's data. You can +share the entire 147 GB of your ML project, with all of its data sources, +intermediate data files, and models with others by setting up DVC +[remote storage](doc/command-reference/remote) (optional). + +This way you can share models trained in a GPU environment with colleagues who +don't have access to GPUs. + +## Never store secrets in the shared config file + +Do not put user credentials in the default config file (`.dvc/config`), which is +tracked by Git. Use the `--local`, `--global`, or `--system` options of +`dvc config` to provide sensitive or user-specific settings: + +```dvc +$ dvc config --local remote.password mypassword # just here +$ dvc config --global core.checksum_jobs 16 # all my projest +$ dvc config --system core.check_update false # all users +``` + +## Tracking experiments with Git + +If you are training different models on your data files in the same project, +using Git commits, tags, or branches makes it easy to manage the project. diff --git a/content/docs/user-guide/tips-and-tricks.md b/content/docs/user-guide/tips-and-tricks.md new file mode 100644 index 0000000000..a6cefa003a --- /dev/null +++ b/content/docs/user-guide/tips-and-tricks.md @@ -0,0 +1,40 @@ +# Tips and tricks for DVC Projects + +Using the methods listed here, you can manage your DVC projects more +efficiently. + +## Switching between datasets + +You can quickly switch between a large dataset and a small subset without +modifying source code: Change the dependencies of stage, either by manually +editing the stage in `dvc.yaml` or by using `dvc run` again with `-f`. + + + +## Tracking small data with Git + +If your `output` files are small in size and you want to track them with Git +then you can use `--outs-no-cache` option to define outputs while creating or +modifying a stage. DVC will not track will not track outputs in this case: + +```dvc +$ dvc run -n train -d src/train.py -d data/features \ + ---outs-no-cache model.p \ + python src/train.py data/features model.pkl +``` + +## Partial reproducibility + +You can run a model's evaluation process again without preprocessing a raw +dataset again, or retraining the model. Pass a target stage to `dvc repro` to +execute only the necessary parts of the pipeline: + +```dvc +$ dvc repro evaluate +``` + +## User metadata in DVC metafiles + +DVC provides an optional `meta` field for `dvc.yaml` and `.dvc` metafiles +(that's very meta!). It can be used to add any user information (as YAML content +e.g. `"a string"`).