Skip to content

Update dvc add documentation#195

Merged
shcheklein merged 3 commits into
masterfrom
add
Mar 5, 2019
Merged

Update dvc add documentation#195
shcheklein merged 3 commits into
masterfrom
add

Conversation

@robogeek
Copy link
Copy Markdown
Contributor

@robogeek robogeek commented Mar 3, 2019

No description provided.

Comment thread static/docs/commands-reference/add.md Outdated
Comment thread static/docs/commands-reference/add.md Outdated
Comment thread static/docs/commands-reference/add.md
Comment thread static/docs/commands-reference/add.md Outdated
Comment thread static/docs/commands-reference/add.md
Comment thread static/docs/commands-reference/add.md Outdated
Copy link
Copy Markdown
Contributor

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great overall. Put a few comments inline to address. In addition to that:

  1. There are lines with trailing whitespaces. Again, you can enable a mode in VS code (or Vim or any other editor) to highlight them. It's bad because it creates lines like this "end of word new word" (there is at least one I noticed in this PR).
  2. That paragraph that describes why dvc add is useful requires some clarification. Especially the manually word. It's reasonable to take under control one of the outputs and update it with some script running it via make. Does it fit into manual? I'm not sure. And the second very common scenario is to take input data (that is usually pretty static) to create a pipeline with dvc run on top after that. You actually have to use dvc add for input datasets.
  3. Examples, examples, examples. The biggest problem. Output is way too much detailed. 4 dvc files - too many for the task, etc. Let's think this through. Let's also mention that

@robogeek
Copy link
Copy Markdown
Contributor Author

robogeek commented Mar 5, 2019

@shcheklein said:

And the second very common scenario is to take input data (that is usually pretty static) to create a pipeline with dvc run on top after that. You actually have to use dvc add for input datasets.

I'm not sure I follow, based on the replication I just did. Namely dvc init --no-scm then copy train.py and the data directory from the Versioning example, then using the dvc run example at the bottom of that example.

Specifically:

2031  dvc init --no-scm
2032  cp ../example-versioning/requirements.txt .
2033  cp ../example-versioning/train.py .
2034  (cd ../example-versioning/; tar cf - data) | tar xvf -
2035  dvc run -f train.py               -d train.py -d data               -M metrics.json               -o model.h5 -o bottleneck_features_train.npy -o bottleneck_features_validation.npy               python train.py
2036  cat train.py

I got this stage file:

cmd: python train.py
deps:
- md5: 78d98f3865c2fcfe1dbe95b738960d0a
  path: train.py
- md5: b8f4d5a78e55e88906d5f4aeaf43802e.dir
  path: data
md5: 71acee76d9f4458059ae5b7f2435cb32
outs:
- cache: true
  md5: 6a92af2a09ec797dcb0dab2cfa1ac778
  metric: false
  path: model.h5
- cache: true
  md5: da9e20b12aa5b2dc0abb02e1a1b4e4cf
  metric: false
  path: bottleneck_features_train.npy
- cache: true
  md5: e548cc847339c990a7dbe0759d87c7c4
  metric: false
  path: bottleneck_features_validation.npy
- cache: false
  md5: 0b14406a44c15521efc0f4d96c80befd
  metric: true
  path: metrics.json
wdir: .

The data directory is listed as a dep just like it would have been if the data directory was added separately. Unless I'm missing something.

I reran the same steps, but used dvc add and came up with an identical DVC file.

Specifically:

2042  dvc init --no-scm
2043  cp ../example-versioning/train.py .
2044  (cd ../example-versioning/; tar cf - data) | tar xf -
2045  dvc add data
2046  dvc run -f train.dvc               -d train.py -d data               -M metrics.json               -o model.h5 -o bottleneck_features_train.npy -o bottleneck_features_validation.npy               python train.py
2047  cat train.dvc 

Since train.dvc came up the same each time, is there really a requirement to use dvc add to add a dataset before doing dvc run? I don't know what the difference is.

@shcheklein
Copy link
Copy Markdown
Contributor

@robogeek yep, the difference is that dvc run does not take -d under control (that's why, for example there is no -D btw). It saves checksums for dependencies so that dvc repro can analyze them and execute the stage if something changed. It saves names (path) to connect DVC files into a graph.

@robogeek
Copy link
Copy Markdown
Contributor Author

robogeek commented Mar 5, 2019

I have updated the document to match the comments above

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants