Update dvc add documentation by robogeek · Pull Request #195 · treeverse/dvc.org

robogeek · 2019-03-03T17:12:58Z

No description provided.

shcheklein

Looks great overall. Put a few comments inline to address. In addition to that:

There are lines with trailing whitespaces. Again, you can enable a mode in VS code (or Vim or any other editor) to highlight them. It's bad because it creates lines like this "end of word new word" (there is at least one I noticed in this PR).
That paragraph that describes why dvc add is useful requires some clarification. Especially the manually word. It's reasonable to take under control one of the outputs and update it with some script running it via make. Does it fit into manual? I'm not sure. And the second very common scenario is to take input data (that is usually pretty static) to create a pipeline with dvc run on top after that. You actually have to use dvc add for input datasets.
Examples, examples, examples. The biggest problem. Output is way too much detailed. 4 dvc files - too many for the task, etc. Let's think this through. Let's also mention that

robogeek · 2019-03-05T00:01:07Z

And the second very common scenario is to take input data (that is usually pretty static) to create a pipeline with dvc run on top after that. You actually have to use dvc add for input datasets.

I'm not sure I follow, based on the replication I just did. Namely dvc init --no-scm then copy train.py and the data directory from the Versioning example, then using the dvc run example at the bottom of that example.

Specifically:

2031  dvc init --no-scm
2032  cp ../example-versioning/requirements.txt .
2033  cp ../example-versioning/train.py .
2034  (cd ../example-versioning/; tar cf - data) | tar xvf -
2035  dvc run -f train.py               -d train.py -d data               -M metrics.json               -o model.h5 -o bottleneck_features_train.npy -o bottleneck_features_validation.npy               python train.py
2036  cat train.py

I got this stage file:

cmd: python train.py
deps:
- md5: 78d98f3865c2fcfe1dbe95b738960d0a
  path: train.py
- md5: b8f4d5a78e55e88906d5f4aeaf43802e.dir
  path: data
md5: 71acee76d9f4458059ae5b7f2435cb32
outs:
- cache: true
  md5: 6a92af2a09ec797dcb0dab2cfa1ac778
  metric: false
  path: model.h5
- cache: true
  md5: da9e20b12aa5b2dc0abb02e1a1b4e4cf
  metric: false
  path: bottleneck_features_train.npy
- cache: true
  md5: e548cc847339c990a7dbe0759d87c7c4
  metric: false
  path: bottleneck_features_validation.npy
- cache: false
  md5: 0b14406a44c15521efc0f4d96c80befd
  metric: true
  path: metrics.json
wdir: .

The data directory is listed as a dep just like it would have been if the data directory was added separately. Unless I'm missing something.

I reran the same steps, but used dvc add and came up with an identical DVC file.

Specifically:

2042  dvc init --no-scm
2043  cp ../example-versioning/train.py .
2044  (cd ../example-versioning/; tar cf - data) | tar xf -
2045  dvc add data
2046  dvc run -f train.dvc               -d train.py -d data               -M metrics.json               -o model.h5 -o bottleneck_features_train.npy -o bottleneck_features_validation.npy               python train.py
2047  cat train.dvc

Since train.dvc came up the same each time, is there really a requirement to use dvc add to add a dataset before doing dvc run? I don't know what the difference is.

shcheklein · 2019-03-05T00:32:42Z

@robogeek yep, the difference is that dvc run does not take -d under control (that's why, for example there is no -D btw). It saves checksums for dependencies so that dvc repro can analyze them and execute the stage if something changed. It saves names (path) to connect DVC files into a graph.

robogeek · 2019-03-05T04:59:59Z

I have updated the document to match the comments above

robogeek added 2 commits March 3, 2019 09:11

Update dvc add documentation

0dc1b6c

A little more update to dvc add command

c7a53f7