-
Notifications
You must be signed in to change notification settings - Fork 409
guide: best-practices section #1748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,88 @@ | ||
| # Best Practices for DVC Projects | ||
|
|
||
| DVC provides a systematic approach towards managing and collaborating on data | ||
| science projects. Here are a few recommended practices to organize your workflow | ||
| and project structure effectively: | ||
|
|
||
| > See also these quick [tips & tricks](/doc/user-guide/tips-and-tricks). | ||
|
|
||
| ## Matching source code to data | ||
|
|
||
| One of DVC's basic uses is to avoid a disconnection between | ||
| [revisions](https://git-scm.com/docs/revisions) of source code and | ||
| [versions](/doc/use-cases/versioning-data-and-model-files) of data. DVC replaces | ||
| large data files and directories, models, etc. with small | ||
| [metafiles](/doc/user-guide/dvc-files-and-directories), which you can track with | ||
| Git, along with the corresponding code. | ||
|
|
||
| These metafiles point to the original data, which is <abbr>cached</abbr> | ||
| automatically. You can access it later by restoring that Git working tree (e.g. | ||
| with `git checkout`) and using `dvc checkout` to update DVC tracked data | ||
| files/dir: | ||
|
|
||
| ```dvc | ||
| $ git checkout 95485f # Git commit of a desired project version | ||
| $ dvc checkout | ||
| ``` | ||
|
|
||
| > See | ||
| > [Versioning Data and Model Files](/doc/use-cases/versioning-data-and-model-files) | ||
| > for more details. | ||
|
|
||
| ## Using directories as single data units | ||
|
|
||
| If your dataset consist of multiple files like images, etc. then the best way to | ||
| track it is | ||
| [as a directory](/doc/command-reference/add#adding-entire-directories), with a | ||
| single `.dvc` file: | ||
|
|
||
| ```dvc | ||
| $ dvc add data/images/ | ||
| ``` | ||
|
|
||
| ## Manually editing dvc.yaml or .dvc files | ||
|
|
||
| It's safe to edit `dvc.yaml` and `.dvc` files. Here's a `dvc.yaml` example: | ||
|
|
||
| ```yaml | ||
| stages: | ||
| prepare: | ||
| cmd: python src/prepare.py data/data.xml | ||
| deps: | ||
| - data/data.xml | ||
| params: | ||
| - prepare.split | ||
| outs: | ||
| - data/prepared | ||
| ``` | ||
|
|
||
| You can manually edit all the fields present in `dvc.yaml`. However, in `.dvc` | ||
| files please remember not to change the `md5` or `checksum` fields as they | ||
| contain hash values which DVC uses to track the file or directory. | ||
|
|
||
| ## Managing and sharing large data | ||
|
|
||
| Traditional or cloud storage can be used to store the project's data. You can | ||
| share the entire 147 GB of your ML project, with all of its data sources, | ||
| intermediate data files, and models with others by setting up DVC | ||
| [remote storage](doc/command-reference/remote) (optional). | ||
|
|
||
| This way you can share models trained in a GPU environment with colleagues who | ||
| don't have access to GPUs. | ||
|
Comment on lines
+63
to
+71
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But what's the best practice? |
||
|
|
||
| ## Never store secrets in the shared config file | ||
|
|
||
| Do not put user credentials in the default config file (`.dvc/config`), which is | ||
| tracked by Git. Use the `--local`, `--global`, or `--system` options of | ||
| `dvc config` to provide sensitive or user-specific settings: | ||
|
|
||
| ```dvc | ||
| $ dvc config --local remote.password mypassword # just here | ||
| $ dvc config --global core.checksum_jobs 16 # all my projest | ||
| $ dvc config --system core.check_update false # all users | ||
| ``` | ||
|
|
||
| ## Tracking experiments with Git | ||
|
|
||
| If you are training different models on your data files in the same project, | ||
| using Git commits, tags, or branches makes it easy to manage the project. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,40 @@ | ||
| # Tips and tricks for DVC Projects | ||
|
|
||
| Using the methods listed here, you can manage your DVC projects more | ||
| efficiently. | ||
|
|
||
| ## Switching between datasets | ||
|
|
||
| You can quickly switch between a large dataset and a small subset without | ||
| modifying source code: Change the dependencies of stage, either by manually | ||
| editing the stage in `dvc.yaml` or by using `dvc run` again with `-f`. | ||
|
|
||
| <!-- TODO: needs actual example --> | ||
|
|
||
| ## Tracking small data with Git | ||
|
|
||
| If your `output` files are small in size and you want to track them with Git | ||
| then you can use `--outs-no-cache` option to define outputs while creating or | ||
| modifying a stage. DVC will not track will not track outputs in this case: | ||
|
|
||
| ```dvc | ||
| $ dvc run -n train -d src/train.py -d data/features \ | ||
| ---outs-no-cache model.p \ | ||
| python src/train.py data/features model.pkl | ||
| ``` | ||
|
|
||
| ## Partial reproducibility | ||
|
|
||
| You can run a model's evaluation process again without preprocessing a raw | ||
| dataset again, or retraining the model. Pass a target stage to `dvc repro` to | ||
| execute only the necessary parts of the pipeline: | ||
|
|
||
| ```dvc | ||
| $ dvc repro evaluate | ||
| ``` | ||
|
|
||
| ## User metadata in DVC metafiles | ||
|
|
||
| DVC provides an optional `meta` field for `dvc.yaml` and `.dvc` metafiles | ||
| (that's very meta!). It can be used to add any user information (as YAML content | ||
| e.g. `"a string"`). |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't really a best practice, just an intro to data tracking with DVC. I guess it could stay here... Not sure 🤔