Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions content/docs/sidebar.json
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,7 @@
"label": "DVC Files and Directories",
"slug": "dvc-files-and-directories"
},
"best-practices",
"merge-conflicts",
{
"slug": "dvcignore",
Expand Down
88 changes: 88 additions & 0 deletions content/docs/user-guide/best-practices.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# Best Practices for DVC Projects

DVC provides a systematic approach towards managing and collaborating on data
science projects. Here are a few recommended practices to organize your workflow
and project structure effectively:

> See also these quick [tips & tricks](/doc/user-guide/tips-and-tricks).

## Matching source code to data

One of DVC's basic uses is to avoid a disconnection between
[revisions](https://git-scm.com/docs/revisions) of source code and
[versions](/doc/use-cases/versioning-data-and-model-files) of data. DVC replaces
large data files and directories, models, etc. with small
Comment on lines +9 to +14
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't really a best practice, just an intro to data tracking with DVC. I guess it could stay here... Not sure 🤔

[metafiles](/doc/user-guide/dvc-files-and-directories), which you can track with
Git, along with the corresponding code.

These metafiles point to the original data, which is <abbr>cached</abbr>
automatically. You can access it later by restoring that Git working tree (e.g.
with `git checkout`) and using `dvc checkout` to update DVC tracked data
files/dir:

```dvc
$ git checkout 95485f # Git commit of a desired project version
$ dvc checkout
```

> See
> [Versioning Data and Model Files](/doc/use-cases/versioning-data-and-model-files)
> for more details.

## Using directories as single data units

If your dataset consist of multiple files like images, etc. then the best way to
track it is
[as a directory](/doc/command-reference/add#adding-entire-directories), with a
single `.dvc` file:

```dvc
$ dvc add data/images/
```

## Manually editing dvc.yaml or .dvc files

It's safe to edit `dvc.yaml` and `.dvc` files. Here's a `dvc.yaml` example:

```yaml
stages:
prepare:
cmd: python src/prepare.py data/data.xml
deps:
- data/data.xml
params:
- prepare.split
outs:
- data/prepared
```

You can manually edit all the fields present in `dvc.yaml`. However, in `.dvc`
files please remember not to change the `md5` or `checksum` fields as they
contain hash values which DVC uses to track the file or directory.

## Managing and sharing large data

Traditional or cloud storage can be used to store the project's data. You can
share the entire 147 GB of your ML project, with all of its data sources,
intermediate data files, and models with others by setting up DVC
[remote storage](doc/command-reference/remote) (optional).

This way you can share models trained in a GPU environment with colleagues who
don't have access to GPUs.
Comment on lines +63 to +71
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But what's the best practice?


## Never store secrets in the shared config file

Do not put user credentials in the default config file (`.dvc/config`), which is
tracked by Git. Use the `--local`, `--global`, or `--system` options of
`dvc config` to provide sensitive or user-specific settings:

```dvc
$ dvc config --local remote.password mypassword # just here
$ dvc config --global core.checksum_jobs 16 # all my projest
$ dvc config --system core.check_update false # all users
```

## Tracking experiments with Git

If you are training different models on your data files in the same project,
using Git commits, tags, or branches makes it easy to manage the project.
40 changes: 40 additions & 0 deletions content/docs/user-guide/tips-and-tricks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# Tips and tricks for DVC Projects

Using the methods listed here, you can manage your DVC projects more
efficiently.

## Switching between datasets

You can quickly switch between a large dataset and a small subset without
modifying source code: Change the dependencies of stage, either by manually
editing the stage in `dvc.yaml` or by using `dvc run` again with `-f`.

<!-- TODO: needs actual example -->

## Tracking small data with Git

If your `output` files are small in size and you want to track them with Git
then you can use `--outs-no-cache` option to define outputs while creating or
modifying a stage. DVC will not track will not track outputs in this case:

```dvc
$ dvc run -n train -d src/train.py -d data/features \
---outs-no-cache model.p \
python src/train.py data/features model.pkl
```

## Partial reproducibility

You can run a model's evaluation process again without preprocessing a raw
dataset again, or retraining the model. Pass a target stage to `dvc repro` to
execute only the necessary parts of the pipeline:

```dvc
$ dvc repro evaluate
```

## User metadata in DVC metafiles

DVC provides an optional `meta` field for `dvc.yaml` and `.dvc` metafiles
(that's very meta!). It can be used to add any user information (as YAML content
e.g. `"a string"`).