From 0dc1b6c0d08d8fea75cca5384537954ebf4d7b61 Mon Sep 17 00:00:00 2001 From: David Herron Date: Sun, 3 Mar 2019 09:11:32 -0800 Subject: [PATCH 1/3] Update dvc add documentation --- static/docs/commands-reference/add.md | 193 ++++++++++++++++++++------ 1 file changed, 150 insertions(+), 43 deletions(-) diff --git a/static/docs/commands-reference/add.md b/static/docs/commands-reference/add.md index a013152c51..787d7b13a4 100644 --- a/static/docs/commands-reference/add.md +++ b/static/docs/commands-reference/add.md @@ -2,82 +2,189 @@ Take a data file or a directory under DVC control. +## Synopsis + ```usage - usage: dvc add [-h] [-q] [-v] [-R] [--no-commit] targets [targets ...] + usage: dvc add [-h] [-q | -v] [-R] [--no-commit] targets [targets ...] positional arguments: - targets Input files/directories - - optional arguments: - -h, --help show this help message and exit - -q, --quiet Be quiet. - -v, --verbose Be verbose. - -R, --recursive Recursively add each file under the directory. - --no-commit Don't save changes to cache. + targets Input files/directories. + ``` -Under the hood a few steps are happening: +## Description + +The `dvc add` command is analogous to the `git add` command. By default an +added file is committed to the DVC cache. Using the `--no-commit` option, the +file will not be added to the cache and instead the `dvc commit` command is +used when (or if) the file is to be committed to the DVC cache. + +Under the hood a few actions are taken for each file in the target(s): 1. Move the file content to the DVC cache (default location is `.dvc/cache`). 2. Calculate the file checksum. 3. Replace the file by a link to the file in the cache (see details below). 4. Create a corresponding DVC file (metafile `.dvc`) and store the checksum -to identify the right file in cache. - -Only _metafile_ (basically, pointer to the data in cache) is stored in Git, -DVC manages data file contents. - -See [DVC File Format](/doc/user-guide/dvc-file-format) for the detailed -description of the _metafile_ format. - -DVC stores the file's last modification timestamp, inode, and the checksum into -a global state file `.dvc/state` to reduce time recomputing checksums later. - -Note, by default dvc tries a range of link types (reflink, hardlink, symlink, -copy) to try to avoid copying any file contents and make dvc file operations -very quick even for large files. Reflink is the best link type we could have, -but even though it is becoming more and more common in modern filesystems, many -filesystems still don't support it and thus dvc has to resort to a much more -common hardlinks. See `dvc config` for more information. + to identify the cache entry. +5. Add the _target_ filename to `.gitignore` to prevent it from being + committed to the Git repository. +6. Instructions are printed showing `git` commands for the files to be added to + the Git repository. + +The result is data file is added to the DVC cache, and the Git repository stores +the metafile (`.dvc`). The stage file (metafile) lists the added file as an +`out` (output) of the stage, and references the DVC cache entry using the +checksum. See [DVC File Format](/doc/user-guide/dvc-file-format) for the +detailed description of the DVC _metafile_ format. + +By default DVC tries a range of link types (`reflink`, `hardlink`, `symlink`, +or `copy`) to try to avoid copying any file contents and to optimize DVC file +operations even for large files. The `reflink` is the best link type available, +but even though it is frequently supported by modern filesystems, many others +still don't support it. DVC has the other link types for use on filesystems +without `reflink` support. See `dvc config` for more information. + +A `dvc add` target can be an individual file or a directory. There are two ways +to work with directory hierarchies with `dvc add`. + +1. With `dvc add --recursive`, the hierarchy is traversed and every file is + added individually as described above. This means every file has its own + `.dvc` file, and a corresponding DVC cache entry is made. If the + `--no-commit` flag is added the DVC cache entry is not made. +2. When not using `--recursive` a DVC stage file is created for the top of + the directory (`dirname.dvc`), and every file in the hierarchy is added to the + DVC cache, but these files do not have individual DVC files. Instead the DVC + file for the directory has a corresponding file in the DVC cache containing + references to the files in the directory hierarchy. + +## Options + +* `-R`, `--recursive` Recursively add each file under the named directory. For + each file a new DVC file is created using the process described earlier. + +* `--no-commit` Do not put files/directories into cache. A stage file is created, + and an entry is added to `.dvc/state`, while nothing is added to the + cache (`.dvc/cache`). The `dvc status` command will note that the file + is `not in cache`. The `dvc commit` command will add the file to + the DVC cache. This is analogous to the `git add` and `git commit` commands. + +* `-h`, `--help` - prints the usage/help message, and exit. + +* `-q`, `--quiet` - do not write anything to standard output. Exit with 0 if + all stages are up to date or if all stages are successfully rerun, otherwise + exit with 1. + +* `-v`, `--verbose` - displays detailed tracing information from executing the + `dvc add` command. -For directories, the command does the same steps for each file recursively. -To retain information about the directory structure, a corresponding cache -file will be created in `.dvc/cache`. ## Examples Take files under DVC control: ```dvc + $ ls raw - Badges.xml PostLinks.xml Votes.xml + dog.111.jpg dog.121.jpg dog.131.jpg dog.141.jpg + + $ dvc add raw/dog.111.jpg raw/dog.121.jpg raw/dog.131.jpg raw/dog.141.jpg + + Adding 'raw/dog.111.jpg' to 'raw/.gitignore'. + Saving 'raw/dog.111.jpg' to cache '.dvc/cache'. + Saving information to 'raw/dog.111.jpg.dvc'. + + To track the changes with git run: + + git add raw/.gitignore raw/dog.111.jpg.dvc + + Adding 'raw/dog.121.jpg' to 'raw/.gitignore'. + Saving 'raw/dog.121.jpg' to cache '.dvc/cache'. + Saving information to 'raw/dog.121.jpg.dvc'. + + To track the changes with git run: + + git add raw/.gitignore raw/dog.121.jpg.dvc - $ dvc add raw/Badges.xml raw/PostLinks.xml raw/Votes.xml + Adding 'raw/dog.131.jpg' to 'raw/.gitignore'. + Saving 'raw/dog.131.jpg' to cache '.dvc/cache'. + Saving information to 'raw/dog.131.jpg.dvc'. + + To track the changes with git run: + + git add raw/.gitignore raw/dog.131.jpg.dvc + + Adding 'raw/dog.141.jpg' to 'raw/.gitignore'. + Saving 'raw/dog.141.jpg' to cache '.dvc/cache'. + Saving information to 'raw/dog.141.jpg.dvc'. + + To track the changes with git run: + + git add raw/.gitignore raw/dog.141.jpg.dvc ``` -Note, DVC files have been created: +As the output says, stage files have been created for each file. Let us explore +the results. + +We see that DVC files were created: ```dvc $ ls raw - - Badges.xml PostLinks.xml Votes.xml - Badges.xml.dvc PostLinks.xml.dvc Votes.xml.dvc + + dog.111.jpg dog.111.jpg.dvc dog.121.jpg dog.121.jpg.dvc + dog.131.jpg dog.131.jpg.dvc dog.141.jpg dog.141.jpg.dvc ``` -Let's check one of them: +Let's check the format used for the DVC files. ``` - $ cat raw/Badges.xml.dvc + $ cat raw/dog.111.jpg.dvc - md5: e16f4a8bb4cd3c30562221b3271b92a6 + md5: aae37d74224b05178153acd94e15956b outs: - cache: true - md5: 573e3e83636983961017902c60175bc0 + md5: d8acabbfd4ee51c95da5d7628c7ef74b metric: false - path: Badges.xml + path: dog.111.jpg +``` + +This is a standard DVC stage file with only an `outs` entry. The checksum +should correspond to an entry in the cache. + +```dvc + $ grep ' md5' raw/*.dvc + + raw/dog.111.jpg.dvc: md5: d8acabbfd4ee51c95da5d7628c7ef74b + raw/dog.121.jpg.dvc: md5: 678cf9d7a63239a887eeb30817379a80 + raw/dog.131.jpg.dvc: md5: e6b8b22c62d141a248c07d80ec48534f + raw/dog.141.jpg.dvc: md5: 5d78f8e9366e44f5a77e79d3bd1bd904 +``` + +This gives us the checksum for each file. + +```dvc + $ ls .dvc/cache/d8/acabbfd4ee51c95da5d7628c7ef74b + .dvc/cache/d8/acabbfd4ee51c95da5d7628c7ef74b + $ file .dvc/cache/d8/acabbfd4ee51c95da5d7628c7ef74b + .dvc/cache/d8/acabbfd4ee51c95da5d7628c7ef74b: JPEG image data, JFIF standard 1.01, aspect ratio, density 1x1, segment length 16, baseline, precision 8, 499x375, frames 3 + + $ ls .dvc/cache/67/8cf9d7a63239a887eeb30817379a80 + .dvc/cache/67/8cf9d7a63239a887eeb30817379a80 + $ file .dvc/cache/67/8cf9d7a63239a887eeb30817379a80 + .dvc/cache/67/8cf9d7a63239a887eeb30817379a80: JPEG image data, JFIF standard 1.01, aspect ratio, density 1x1, segment length 16, baseline, precision 8, 500x334, frames 3 + + $ ls .dvc/cache/5d/78f8e9366e44f5a77e79d3bd1bd904 + .dvc/cache/5d/78f8e9366e44f5a77e79d3bd1bd904 + $ file .dvc/cache/e6/b8b22c62d141a248c07d80ec48534f + .dvc/cache/e6/b8b22c62d141a248c07d80ec48534f: JPEG image data, JFIF standard 1.01, aspect ratio, density 1x1, segment length 16, baseline, precision 8, 468x423, frames 3 + + $ ls .dvc/cache/5d/78f8e9366e44f5a77e79d3bd1bd904 + .dvc/cache/5d/78f8e9366e44f5a77e79d3bd1bd904 + $ file .dvc/cache/5d/78f8e9366e44f5a77e79d3bd1bd904 + .dvc/cache/5d/78f8e9366e44f5a77e79d3bd1bd904: JPEG image data, JFIF standard 1.01, aspect ratio, density 1x1, segment length 16, baseline, precision 8, 500x374, frames 3 ``` -You can see that the file contains a checksum for the file. It basically serves -as a pointer to the remote storage or local cache. +Then we can individually verify each has a corresponding DVC cache entry. With +the `file` command we verify that each is a JPEG image with the expected +characteristics. From c7a53f7bc6a0d8337d230f0752024d7e9a7c23b1 Mon Sep 17 00:00:00 2001 From: David Herron Date: Sun, 3 Mar 2019 12:09:43 -0800 Subject: [PATCH 2/3] A little more update to dvc add command --- static/docs/commands-reference/add.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/static/docs/commands-reference/add.md b/static/docs/commands-reference/add.md index 787d7b13a4..9da8a89622 100644 --- a/static/docs/commands-reference/add.md +++ b/static/docs/commands-reference/add.md @@ -57,6 +57,10 @@ to work with directory hierarchies with `dvc add`. file for the directory has a corresponding file in the DVC cache containing references to the files in the directory hierarchy. +The `dvc add` command is useful for manually maintaining updates to files. For +files that are the outputs of running a command, it is better to use the +`dvc run` command to create a DVC stage file listing dependencies and outputs. + ## Options * `-R`, `--recursive` Recursively add each file under the named directory. For From 827e4fc12062214983d8dad75b0ea1040737727a Mon Sep 17 00:00:00 2001 From: David Herron Date: Mon, 4 Mar 2019 20:58:36 -0800 Subject: [PATCH 3/3] Improved discussion and examples in dvc add command --- static/docs/commands-reference/add.md | 202 +++++++++++++++----------- 1 file changed, 121 insertions(+), 81 deletions(-) diff --git a/static/docs/commands-reference/add.md b/static/docs/commands-reference/add.md index 9da8a89622..687700885e 100644 --- a/static/docs/commands-reference/add.md +++ b/static/docs/commands-reference/add.md @@ -5,7 +5,9 @@ Take a data file or a directory under DVC control. ## Synopsis ```usage - usage: dvc add [-h] [-q | -v] [-R] [--no-commit] targets [targets ...] + usage: dvc add [-h] [-q | -v] + [-R] [--no-commit] + targets [targets ...] positional arguments: targets Input files/directories. @@ -14,7 +16,7 @@ Take a data file or a directory under DVC control. ## Description -The `dvc add` command is analogous to the `git add` command. By default an +The `dvc add` command is analogous to the `git add` command. By default an added file is committed to the DVC cache. Using the `--no-commit` option, the file will not be added to the cache and instead the `dvc commit` command is used when (or if) the file is to be committed to the DVC cache. @@ -26,56 +28,59 @@ Under the hood a few actions are taken for each file in the target(s): 3. Replace the file by a link to the file in the cache (see details below). 4. Create a corresponding DVC file (metafile `.dvc`) and store the checksum to identify the cache entry. -5. Add the _target_ filename to `.gitignore` to prevent it from being - committed to the Git repository. -6. Instructions are printed showing `git` commands for the files to be added to - the Git repository. +5. Add the _target_ filename to `.gitignore` (if Git is used in this workspace) + to prevent it from being committed to the Git repository. This behavior is + prevented if the workspace is initialized with the `--no-scm` option. +6. Instructions are printed showing `git` commands for adding the files to a + Git repository. If a different SCM system is being used, use the equivalent + command for that system. The result is data file is added to the DVC cache, and the Git repository stores the metafile (`.dvc`). The stage file (metafile) lists the added file as an -`out` (output) of the stage, and references the DVC cache entry using the -checksum. See [DVC File Format](/doc/user-guide/dvc-file-format) for the +`out` (output) of the stage, and references the DVC cache entry using the +checksum. See [DVC File Format](/doc/user-guide/dvc-file-format) for the detailed description of the DVC _metafile_ format. By default DVC tries a range of link types (`reflink`, `hardlink`, `symlink`, -or `copy`) to try to avoid copying any file contents and to optimize DVC file -operations even for large files. The `reflink` is the best link type available, -but even though it is frequently supported by modern filesystems, many others -still don't support it. DVC has the other link types for use on filesystems +or `copy`) to try to avoid copying any file contents and to optimize DVC file +operations even for large files. The `reflink` is the best link type available, +but even though it is frequently supported by modern filesystems, many others +still don't support it. DVC has the other link types for use on filesystems without `reflink` support. See `dvc config` for more information. A `dvc add` target can be an individual file or a directory. There are two ways to work with directory hierarchies with `dvc add`. -1. With `dvc add --recursive`, the hierarchy is traversed and every file is - added individually as described above. This means every file has its own - `.dvc` file, and a corresponding DVC cache entry is made. If the +1. With `dvc add --recursive`, the hierarchy is traversed and every file is + added individually as described above. This means every file has its own + `.dvc` file, and a corresponding DVC cache entry is made. If the `--no-commit` flag is added the DVC cache entry is not made. -2. When not using `--recursive` a DVC stage file is created for the top of - the directory (`dirname.dvc`), and every file in the hierarchy is added to the - DVC cache, but these files do not have individual DVC files. Instead the DVC - file for the directory has a corresponding file in the DVC cache containing +2. When not using `--recursive` a DVC stage file is created for the top of + the directory (`dirname.dvc`), and every file in the hierarchy is added to the + DVC cache, but these files do not have individual DVC files. Instead the DVC + file for the directory has a corresponding file in the DVC cache containing references to the files in the directory hierarchy. -The `dvc add` command is useful for manually maintaining updates to files. For -files that are the outputs of running a command, it is better to use the -`dvc run` command to create a DVC stage file listing dependencies and outputs. +The `dvc add` command is useful for manually maintaining updates to datasets +and files. For files that are the outputs of running a command, it is better +to use the `dvc run` command to create a DVC stage file listing dependencies +and outputs. ## Options -* `-R`, `--recursive` Recursively add each file under the named directory. For +* `-R`, `--recursive` Recursively add each file under the named directory. For each file a new DVC file is created using the process described earlier. * `--no-commit` Do not put files/directories into cache. A stage file is created, and an entry is added to `.dvc/state`, while nothing is added to the cache (`.dvc/cache`). The `dvc status` command will note that the file - is `not in cache`. The `dvc commit` command will add the file to + is `not in cache`. The `dvc commit` command will add the file to the DVC cache. This is analogous to the `git add` and `git commit` commands. * `-h`, `--help` - prints the usage/help message, and exit. * `-q`, `--quiet` - do not write anything to standard output. Exit with 0 if - all stages are up to date or if all stages are successfully rerun, otherwise + all stages are up to date or if all stages are successfully rerun, otherwise exit with 1. * `-v`, `--verbose` - displays detailed tracing information from executing the @@ -86,13 +91,13 @@ files that are the outputs of running a command, it is better to use the Take files under DVC control: -```dvc +``` $ ls raw - dog.111.jpg dog.121.jpg dog.131.jpg dog.141.jpg + dog.111.jpg - $ dvc add raw/dog.111.jpg raw/dog.121.jpg raw/dog.131.jpg raw/dog.141.jpg + $ dvc add raw/dog.111.jpg Adding 'raw/dog.111.jpg' to 'raw/.gitignore'. Saving 'raw/dog.111.jpg' to cache '.dvc/cache'. @@ -102,29 +107,6 @@ Take files under DVC control: git add raw/.gitignore raw/dog.111.jpg.dvc - Adding 'raw/dog.121.jpg' to 'raw/.gitignore'. - Saving 'raw/dog.121.jpg' to cache '.dvc/cache'. - Saving information to 'raw/dog.121.jpg.dvc'. - - To track the changes with git run: - - git add raw/.gitignore raw/dog.121.jpg.dvc - - Adding 'raw/dog.131.jpg' to 'raw/.gitignore'. - Saving 'raw/dog.131.jpg' to cache '.dvc/cache'. - Saving information to 'raw/dog.131.jpg.dvc'. - - To track the changes with git run: - - git add raw/.gitignore raw/dog.131.jpg.dvc - - Adding 'raw/dog.141.jpg' to 'raw/.gitignore'. - Saving 'raw/dog.141.jpg' to cache '.dvc/cache'. - Saving information to 'raw/dog.141.jpg.dvc'. - - To track the changes with git run: - - git add raw/.gitignore raw/dog.141.jpg.dvc ``` As the output says, stage files have been created for each file. Let us explore @@ -132,11 +114,10 @@ the results. We see that DVC files were created: -```dvc +``` $ ls raw - dog.111.jpg dog.111.jpg.dvc dog.121.jpg dog.121.jpg.dvc - dog.131.jpg dog.131.jpg.dvc dog.141.jpg dog.141.jpg.dvc + dog.111.jpg dog.111.jpg.dvc ``` Let's check the format used for the DVC files. @@ -152,43 +133,102 @@ Let's check the format used for the DVC files. path: dog.111.jpg ``` -This is a standard DVC stage file with only an `outs` entry. The checksum +This is a standard DVC stage file with only an `outs` entry. The checksum should correspond to an entry in the cache. -```dvc - $ grep ' md5' raw/*.dvc - - raw/dog.111.jpg.dvc: md5: d8acabbfd4ee51c95da5d7628c7ef74b - raw/dog.121.jpg.dvc: md5: 678cf9d7a63239a887eeb30817379a80 - raw/dog.131.jpg.dvc: md5: e6b8b22c62d141a248c07d80ec48534f - raw/dog.141.jpg.dvc: md5: 5d78f8e9366e44f5a77e79d3bd1bd904 ``` - -This gives us the checksum for each file. - -```dvc $ ls .dvc/cache/d8/acabbfd4ee51c95da5d7628c7ef74b .dvc/cache/d8/acabbfd4ee51c95da5d7628c7ef74b + $ file .dvc/cache/d8/acabbfd4ee51c95da5d7628c7ef74b .dvc/cache/d8/acabbfd4ee51c95da5d7628c7ef74b: JPEG image data, JFIF standard 1.01, aspect ratio, density 1x1, segment length 16, baseline, precision 8, 499x375, frames 3 +``` + +Then we can individually verify each has a corresponding DVC cache entry. With +the `file` command we verify that each is a JPEG image with the expected +characteristics. + +What if you have not one dog picture, but hundreds of pictures of dogs and cats? +Your goal might be to build an algorithm to identify dogs and cats in pictures, +and this is your training data set. + +``` + $ du pics + 11092 pics/train/cats + 13044 pics/train/dogs + 24140 pics/train + 9244 pics/validation/cats + 10496 pics/validation/dogs + 19744 pics/validation + 43888 pics + + $ dvc add pics + Computing md5 for a large directory pics/train/cats. This is only done once. + [##############################] 100% pics/train/cats - $ ls .dvc/cache/67/8cf9d7a63239a887eeb30817379a80 - .dvc/cache/67/8cf9d7a63239a887eeb30817379a80 - $ file .dvc/cache/67/8cf9d7a63239a887eeb30817379a80 - .dvc/cache/67/8cf9d7a63239a887eeb30817379a80: JPEG image data, JFIF standard 1.01, aspect ratio, density 1x1, segment length 16, baseline, precision 8, 500x334, frames 3 + Computing md5 for a large directory pics/train/dogs. This is only done once. + [##############################] 100% pics/train/dogs - $ ls .dvc/cache/5d/78f8e9366e44f5a77e79d3bd1bd904 - .dvc/cache/5d/78f8e9366e44f5a77e79d3bd1bd904 - $ file .dvc/cache/e6/b8b22c62d141a248c07d80ec48534f - .dvc/cache/e6/b8b22c62d141a248c07d80ec48534f: JPEG image data, JFIF standard 1.01, aspect ratio, density 1x1, segment length 16, baseline, precision 8, 468x423, frames 3 + Computing md5 for a large directory pics/validation/cats. This is only done once. + [##############################] 100% pics/validation/cats - $ ls .dvc/cache/5d/78f8e9366e44f5a77e79d3bd1bd904 - .dvc/cache/5d/78f8e9366e44f5a77e79d3bd1bd904 - $ file .dvc/cache/5d/78f8e9366e44f5a77e79d3bd1bd904 - .dvc/cache/5d/78f8e9366e44f5a77e79d3bd1bd904: JPEG image data, JFIF standard 1.01, aspect ratio, density 1x1, segment length 16, baseline, precision 8, 500x374, frames 3 + Computing md5 for a large directory pics/validation/dogs. This is only done once. + [##############################] 100% pics/validation/dogs + Saving 'pics' to cache '.dvc/cache'. + + Linking directory 'pics'. + [##############################] 100% pics + + Saving information to 'pics.dvc'. + + To track the changes with git run: + + git add pics.dvc ``` -Then we can individually verify each has a corresponding DVC cache entry. With -the `file` command we verify that each is a JPEG image with the expected -characteristics. +There are no DVC files generated within this directory structure, but the +images are all added to the DVC cache. DVC prints a message to that effect, +saying that `md5` values are computed for each directory. A DVC file is +generated for the top-level directory, and it contains this: + +```yaml + md5: df06d8d51e6483ed5a74d3979f8fe42e + outs: + - cache: true + md5: b8f4d5a78e55e88906d5f4aeaf43802e.dir + metric: false + path: pics + wdir: . +``` + +If instead you use the `--recursive` option, the output looks as so: + +``` + $ dvc add --recursive pix + Saving 'pix/train/cats/cat.150.jpg' to cache '.dvc/cache'. + Saving 'pix/train/cats/cat.130.jpg' to cache '.dvc/cache'. + Saving 'pix/train/cats/cat.111.jpg' to cache '.dvc/cache'. + Saving 'pix/train/cats/cat.438.jpg' to cache '.dvc/cache'. + ... +``` + +In this case a DVC file corresponding to each file is generated, and no +top-level DVC file is generated. But this is less convenient. + +With the `dvc add pics` a single DVC file is generated, `pics.dvc`, which lets +us treat the entire directory structure in one unit. It lets you pass the +whole directory tree as input to a `dvc run` stage like so: + +``` + $ dvc run -f train.dvc -d train.py -d data -M metrics.json -o model.h5 \ + -o bottleneck_features_train.npy \ + -o bottleneck_features_validation.npy \ + python train.py +``` + +To see this whole example go to +[Example: Versioning](/doc/get-started/example-versioning). + +Since no top-level DVC file is generated with the `--recursive` option we +cannot use the directory structure as a whole. \ No newline at end of file