From ff3e9d696fb18deb630ff9ff0862883489c5ca7a Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 1 Sep 2019 01:55:59 -0500 Subject: [PATCH 01/26] term: change `ENV` for env(ironment) in contributing user guide --- static/docs/user-guide/contributing.md | 21 +++++++++++---------- 1 file changed, 11 insertions(+), 10 deletions(-) diff --git a/static/docs/user-guide/contributing.md b/static/docs/user-guide/contributing.md index 5837a45247..6e20eef32e 100644 --- a/static/docs/user-guide/contributing.md +++ b/static/docs/user-guide/contributing.md @@ -141,23 +141,24 @@ $ pip install -e ".[ssh]" $ pip install -e ".[all]" ``` -You will need to update your `ENV` throughout subsequent steps, so we created a -template for you: +You will need to update your environment throughout subsequent steps, so we +created a template for you: ```dvc $ cp tests/remotes_env.sample tests/remotes_env ``` Then uncomment lines you need and fill in needed values, the details are -explained in remote specific subsections. To activate these env vars use: +explained in remote specific subsections. To activate these environment +variables, use: ```dvc $ source tests/remotes_env ``` -If some member of your team had already went through all of this you may just -ask for their `remotes_env` file and Google Cloud credentials and you can skip -any manipulations with `ENV` below. +If another member of your team has already gone through this guide, you could +just ask for their `remotes_env` file and Google Cloud credentials, and skip any +env manipulations below.
@@ -167,8 +168,8 @@ Install [aws cli](https://docs.aws.amazon.com/en_us/cli/latest/userguide/cli-chap-install.html) tools. -Set up an account, get credentials, which will have access to S3. Then, set -`ENV` vars like this: +Set up an account, get credentials, which will have access to S3. Then, set env +vars like this: ```dvc $ export AWS_ACCESS_KEY_ID="...YOUR-ACCESS-KEY-ID..." @@ -188,7 +189,7 @@ authenticated with your google account. You then need to create a bucket, a service account and get its credentials. You can do this via web UI or terminal. Then you need to put your keys to -`scripts/ci/gcp-creds.json` and add these to your `ENV`: +`scripts/ci/gcp-creds.json` and add these to your env vars: ```dvc $ export GOOGLE_APPLICATION_CREDENTIALS=".gcp-creds.json" @@ -231,7 +232,7 @@ $ mkdir azurite $ azurite -s -l azurite -d azurite/debug.log ``` -Add this to your `ENV`: +Add this to your env: ```dvc $ export AZURE_STORAGE_CONTAINER_NAME="dvc-test" From d86008aa8025bce3919e283b32430af45e7b4fd0 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 1 Sep 2019 02:18:19 -0500 Subject: [PATCH 02/26] remove: update help output to match https://github.com/iterative/dvc/pull/2457/commits/80076d8 --- static/docs/commands-reference/remove.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/static/docs/commands-reference/remove.md b/static/docs/commands-reference/remove.md index 3eb045c710..ca389e20f2 100644 --- a/static/docs/commands-reference/remove.md +++ b/static/docs/commands-reference/remove.md @@ -8,8 +8,7 @@ Properly remove data files or directories tracked by DVC. usage: dvc remove [-h] [-q | -v] [-o | -p] [-f] targets [targets ...] positional arguments: - targets DVC-files to remove. Optional. (Finds all - DVC-files in the workspace by default.) + targets DVC-files to remove. ``` ## Description From c38e6ce406f2af68d8eb643706e8ad4f73f980ca Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 1 Sep 2019 02:26:00 -0500 Subject: [PATCH 03/26] get/import: add link and clarify HEAD `--rev` option default per https://github.com/iterative/dvc.org/pull/591#pullrequestreview-281835310 --- static/docs/commands-reference/get.md | 3 ++- static/docs/commands-reference/import.md | 3 ++- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/static/docs/commands-reference/get.md b/static/docs/commands-reference/get.md index d3d652ed7b..c85ef7d7d8 100644 --- a/static/docs/commands-reference/get.md +++ b/static/docs/commands-reference/get.md @@ -42,7 +42,8 @@ created in the current working directory, with its original file name. isn't used) is the current working directory (`.`) and original file name. - `--rev` - specific Git revision of the DVC repository to import the data from. - `HEAD` by default. + [`HEAD`](https://git-scm.com/book/en/v2/Git-Internals-Git-References#ref_the_ref) + is used by default when this option is not specified. - `-h`, `--help` - prints the usage/help message, and exit. diff --git a/static/docs/commands-reference/import.md b/static/docs/commands-reference/import.md index 86e99838a2..0c2ec76709 100644 --- a/static/docs/commands-reference/import.md +++ b/static/docs/commands-reference/import.md @@ -62,7 +62,8 @@ downloaded data artifact from the external DVC repo. isn't used) is the current working directory (`.`) and original file name. - `--rev` - specific Git revision of the DVC repository to import the data from. - `HEAD` by default. + [`HEAD`](https://git-scm.com/book/en/v2/Git-Internals-Git-References#ref_the_ref) + is used by default when this option is not specified. - `-h`, `--help` - prints the usage/help message, and exit. From 07fb5b072f60acf3fd625c92905f442418920d86 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 1 Sep 2019 02:31:44 -0500 Subject: [PATCH 04/26] changelog: remove extra space in changelog/0.35 per https://github.com/iterative/dvc.org/pull/591#pullrequestreview-281835560 --- static/docs/changelog/0.35.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/static/docs/changelog/0.35.md b/static/docs/changelog/0.35.md index c3b1910151..9a852d0171 100644 --- a/static/docs/changelog/0.35.md +++ b/static/docs/changelog/0.35.md @@ -14,7 +14,7 @@ improvements) we have done in the last few months: - πŸ“– The [Get Started](/doc/get-started/agenda) section has been simplified (e.g. to use tags instead of branches) and extended. We have also prepared a - [Github DVC project ](https://github.com/iterative/example-get-started) that + [Github DVC project](https://github.com/iterative/example-get-started) that reflects the sequence of chapters in the β€œget started” guide. You can now download the whole project and reproduce all the models. From 5bb7edeba1c803eb9c6bda29f4f45d15028f01fe Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 1 Sep 2019 03:12:42 -0500 Subject: [PATCH 05/26] get-started: reformat add-files --- static/docs/get-started/add-files.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/static/docs/get-started/add-files.md b/static/docs/get-started/add-files.md index 7fba170847..ba3e2ec22b 100644 --- a/static/docs/get-started/add-files.md +++ b/static/docs/get-started/add-files.md @@ -16,8 +16,8 @@ $ wget https://data.dvc.org/get-started/data.xml -O data/data.xml If you experienced problems using `wget` or you're on Windows and you don't want to install it, you'll need to use a browser to download `data.xml` and save it into `data` subdirectory. To download, right-click -[this link](https://data.dvc.org/get-started/data.xml) and click `Save link as` (Chrome) or -`Save object as` (Firefox). +[this link](https://data.dvc.org/get-started/data.xml) and click `Save link As` +(Chrome) or `Save Object As` (Firefox).
From 8ec0d542a7a2f1f2b444a5416dc6562e86a6b6e8 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 1 Sep 2019 21:03:12 -0500 Subject: [PATCH 06/26] docs: review usage of "DVC" branding of terms (1) for #448 --- src/Documentation/glossary.js | 3 ++- static/docs/commands-reference/commit.md | 20 +++++++++---------- static/docs/commands-reference/move.md | 8 ++++---- .../docs/commands-reference/pipeline/show.md | 2 +- static/docs/commands-reference/remote/add.md | 6 +++--- .../docs/commands-reference/remote/index.md | 2 +- static/docs/commands-reference/root.md | 10 +++++----- static/docs/commands-reference/status.md | 4 ++-- static/docs/get-started/example-versioning.md | 8 ++++---- static/docs/user-guide/analytics.md | 2 +- 10 files changed, 33 insertions(+), 32 deletions(-) diff --git a/src/Documentation/glossary.js b/src/Documentation/glossary.js index d6077d3037..d9d12fee62 100644 --- a/src/Documentation/glossary.js +++ b/src/Documentation/glossary.js @@ -11,7 +11,8 @@ export default { Directory containing all your project files. For example raw datasets, source code, ML models, etc. A workspace becomes a **DVC project** when [\`dvc init\`](/doc/commands-reference/init) is run, and -[DVC-files](/doc/user-guide/dvc-file-format) or stage files are created in it. +[DVC-files](/doc/user-guide/dvc-file-format) (or stage files) are created in +it. ` }, { diff --git a/static/docs/commands-reference/commit.md b/static/docs/commands-reference/commit.md index ded23eae12..aa943a17fa 100644 --- a/static/docs/commands-reference/commit.md +++ b/static/docs/commands-reference/commit.md @@ -131,8 +131,8 @@ $ dvc pull --all-branches --all-tags Sometimes we want to iterate through multiple changes to configuration, code, or data, trying multiple options to improve the output of a stage. To avoid filling -the DVC cache with undesired intermediate results, we can run a -single stage with `dvc run --no-commit`, or reproduce an entire pipeline using +the cache with undesired intermediate results, we can run a single +stage with `dvc run --no-commit`, or reproduce an entire pipeline using `dvc repro --no-commit`. This prevents data from being pushed to cache. When development of the stage is finished, `dvc commit` can be used to store data files in the DVC cache. @@ -149,7 +149,7 @@ bag_of_words = CountVectorizer(stop_words='english', This option not only changes the trained model, it also introduces a change which would cause the `featurize.dvc`, `train.dvc` and `evaluate.dvc` stages to execute if we ran `dvc repro`. But if we want to try several values for this -option and save only the best result to the DVC cache, we can execute as so: +option and save only the best result to the cache, we can execute as so: ```dvc $ dvc repro --no-commit evaluate.dvc @@ -157,7 +157,7 @@ $ dvc repro --no-commit evaluate.dvc We can run this command as many times as we like, editing `featurize.py` any way we like, and so long as we use `--no-commit`, the data does not get saved to the -DVC cache. But it is instructive to verify that's the case: +cache directory. But it is instructive to verify that's the case: First verification: @@ -194,10 +194,10 @@ wdir: . ``` To verify this instance of `model.pkl` is not in the cache, we must know the -names of the cache files. In the DVC cache the first two characters of the -checksum are used as a directory name, and the file name is the remaining -characters. Therefore, if the file had been committed to the cache it would -appear in the directory `.dvc/cache/70`. But: +path to the cached file. In the cache directory, the first two characters of the +checksum are used as a subdirectory name, and the remaining characters are the +file name. Therefore, if the file had been committed to the cache it would +appear in the directory `.dvc/cache/70`. Let's check: ```dvc $ ls .dvc/cache/70 @@ -215,8 +215,8 @@ $ ls .dvc/cache/70 599f166c2098d7ffca91a369a78b0d ``` -And we've verified that `dvc commit` has saved the changes into the cache, and -that the new instance of `model.pkl` is in the cache. +We've verified that `dvc commit` has saved the changes into the cache, and that +the new instance of `model.pkl` is there. ## Example: Running commands without DVC diff --git a/static/docs/commands-reference/move.md b/static/docs/commands-reference/move.md index dc9775c612..5196c05fbe 100644 --- a/static/docs/commands-reference/move.md +++ b/static/docs/commands-reference/move.md @@ -18,10 +18,10 @@ positional arguments: ## Description `dvc move` is useful when a `src` file or directory has previously been added to -DVC with `dvc add`, creating a [DVC-file](/doc/user-guide/dvc-file-format) (with -`src` as a dependency). `dvc move` behaves like `mv src dst`, moving `src` to -the given `dst` path, but it also renames and updates the corresponding DVC-file -appropriately. +the project with `dvc add`, creating a +[DVC-file](/doc/user-guide/dvc-file-format) (with `src` as a dependency). +`dvc move` behaves like `mv src dst`, moving `src` to the given `dst` path, but +it also renames and updates the corresponding DVC-file appropriately. > Note that `src` may be a copy or a > [link](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache) diff --git a/static/docs/commands-reference/pipeline/show.md b/static/docs/commands-reference/pipeline/show.md index 71ec7396ba..6775de8178 100644 --- a/static/docs/commands-reference/pipeline/show.md +++ b/static/docs/commands-reference/pipeline/show.md @@ -121,7 +121,7 @@ $ dvc pipeline show eval.txt.dvc --ascii List dependencies recursively if graph have tree structure: ```dvc -dvc pipeline show e.file.dvc --tree +$ dvc pipeline show e.file.dvc --tree e.file.dvc β”œβ”€β”€ c.file.dvc β”‚ └── b.file.dvc diff --git a/static/docs/commands-reference/remote/add.md b/static/docs/commands-reference/remote/add.md index fb0f467e27..9e90aeebcd 100644 --- a/static/docs/commands-reference/remote/add.md +++ b/static/docs/commands-reference/remote/add.md @@ -37,7 +37,7 @@ though and will rely on default access settings. > look like this: `pip install "dvc[s3]"`. This installs `boto3` library along > with DVC to support AWS S3 storage. -This command creates a section in the DVC +This command creates a section in the DVC project's [config file](/doc/commands-reference/config) and optionally assigns a default remote in the core section if the `--default` option is used: @@ -195,7 +195,7 @@ $ dvc remote modify myremote connection_string my-connection-string --local ``` > The connection string contains access to data and is inserted into the -> `.dvc/config file.` Therefore, it is safer to add the connection string with +> `.dvc/config` file. Therefore, it is safer to add the connection string with > the `--local` option, enforcing it to be written to a Git-ignored config file. The Azure Blob Storage remote can also be configured entirely via environment @@ -340,7 +340,7 @@ Setting 'myremote' as a default remote. $ dvc remote modify myremote region us-east-2 ``` -DVC config file (`.dvc/config`) now looks like this: +The project's config file (`.dvc/config`) now looks like this: ```ini ['remote "myremote"'] diff --git a/static/docs/commands-reference/remote/index.md b/static/docs/commands-reference/remote/index.md index 9c485649b0..07cd187c22 100644 --- a/static/docs/commands-reference/remote/index.md +++ b/static/docs/commands-reference/remote/index.md @@ -85,7 +85,7 @@ $ dvc remote list myremote /path/to/remote ``` -DVC config file would look like: +The project's config file would look like: ```ini ['remote "myremote"'] diff --git a/static/docs/commands-reference/root.md b/static/docs/commands-reference/root.md index c8af612021..22ec39b3b5 100644 --- a/static/docs/commands-reference/root.md +++ b/static/docs/commands-reference/root.md @@ -1,6 +1,6 @@ # root -Returns relative path to project's directory. +Returns relative path to the DVC project. ## Synopsis @@ -10,10 +10,10 @@ usage: dvc root [-h] [-q | -v] ## Description -While in project's sub-directory, sometimes developers may want to refer some -file belonging to another directory. This command returns relative path to the -DVC project's root directory from the current working directory. So, this -command can be used to build a path to a dependency file, command, or output. +While in sub-directories of the project, sometimes developers may want to refer +some file belonging to another directory. This command returns relative path to +the project root from the current working directory. So this command can be used +to build a path to a dependency file, command, or output. ## Options diff --git a/static/docs/commands-reference/status.md b/static/docs/commands-reference/status.md index de4601af82..51a30a8a7d 100644 --- a/static/docs/commands-reference/status.md +++ b/static/docs/commands-reference/status.md @@ -2,8 +2,8 @@ Show changes in the project [pipelines](/doc/commands-reference/pipeline), as well as mismatches either -between the local cache and local files, or between the local cache and remote -cache. +between the local cache and local files, or between the cache and +remote cache. ## Synopsis diff --git a/static/docs/get-started/example-versioning.md b/static/docs/get-started/example-versioning.md index 1c10d4ab5a..496070e122 100644 --- a/static/docs/get-started/example-versioning.md +++ b/static/docs/get-started/example-versioning.md @@ -334,10 +334,10 @@ Features are written into files, and intention probably was that the very convenient to remember to comment/uncomment it every time dataset is changed. -Here where DVC pipelines feature comes very handy and was designed for. We -touched it briefly when we described `dvc run` and `dvc repro` at the very end. -The next step here would be splitting the script into two parts, and utilizing -DVC [pipelines](/doc/commands-reference/pipeline). See +Here's where the [pipelines](/doc/commands-reference/pipeline) feature of DVC +comes very handy and was designed for. We touched it briefly when we described +`dvc run` and `dvc repro` at the very end. The next step here would be splitting +the script into two parts, and utilizing pipelines. See [this example](/doc/get-started/example-pipeline) to get a hands-on experience with pipelines and try to apply it here. Don't hesitate to join our [community](/chat) to ask any questions! diff --git a/static/docs/user-guide/analytics.md b/static/docs/user-guide/analytics.md index 7c91447e57..a15888f431 100644 --- a/static/docs/user-guide/analytics.md +++ b/static/docs/user-guide/analytics.md @@ -49,7 +49,7 @@ DVC's analytics are sent throughout DVC's proxy to Google Analytics over HTTPS. ## Opting out -DVC analytics helps the entire community and leaving it on is appreciated. +DVC analytics help the entire community, so leaving it on is appreciated. However, if you want to opt out of DVC's analytics, you can disable it via `dvc config` command: From 04133876b7b0cf86f57235d030ae6a97c851a3dc Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 1 Sep 2019 21:22:43 -0500 Subject: [PATCH 07/26] term: "remote cache" -> "remote storage" for #448 --- static/docs/commands-reference/checkout.md | 2 +- static/docs/commands-reference/pull.md | 22 ++++---- static/docs/commands-reference/push.md | 65 +++++++++++----------- static/docs/commands-reference/status.md | 38 ++++++------- 4 files changed, 61 insertions(+), 66 deletions(-) diff --git a/static/docs/commands-reference/checkout.md b/static/docs/commands-reference/checkout.md index 66480796ef..0b149f2124 100644 --- a/static/docs/commands-reference/checkout.md +++ b/static/docs/commands-reference/checkout.md @@ -74,7 +74,7 @@ checked out without error will be restored. There are two methods to restore a file missing from the cache, depending on the situation. In some cases a pipeline must be reproduced (using `dvc repro`) to regenerate its outputs. (See also `dvc pipeline`.) In other cases the cache can -be pulled from a remote cache using `dvc pull`. +be pulled from remote storage using `dvc pull`. ## Options diff --git a/static/docs/commands-reference/pull.md b/static/docs/commands-reference/pull.md index 8dfd7bdc27..365bbac036 100644 --- a/static/docs/commands-reference/pull.md +++ b/static/docs/commands-reference/pull.md @@ -59,12 +59,10 @@ reflinks or hardlinks to put it in the workspace without copying. See ## Options -- `-r REMOTE`, `--remote REMOTE` specifies which remote cache (see - `dvc remote list`) to pull from. The value for `REMOTE` is a cache name - defined using the `dvc remote` command. If no `REMOTE` is given, or if no - remote's are defined in the project, an error message is printed. If the - option is not specified, then the default remote, configured with the - `core.config` config option, is used. +- `-r REMOTE`, `--remote REMOTE` specifies which remote to pull from (see + `dvc remote list`). The value for `REMOTE` is a name defined using + `dvc remote`. If the option is not specified, then the default remote + (configured with the `core.config` config option) is used. - `-a`, `--all-branches` - determines the files to download by examining DVC-files in all branches of the project repository (if using Git). It's @@ -89,10 +87,10 @@ reflinks or hardlinks to put it in the workspace without copying. See in effect performs those 2 functions in a single command. - `-j JOBS`, `--jobs JOBS` - specifies number of jobs to run simultaneously - while downloading files from the remote cache. The effect is to control the - number of files downloaded simultaneously. Default is `4 * cpu_count()`. For - example with `-j 1` DVC downloads one file at a time, with `-j 2` it downloads - two at a time, and so forth. For SSH remotes default is set to 4. + while downloading files from the remote. The effect is to control the number + of files downloaded simultaneously. Default is `4 * cpu_count()`. For example + with `-j 1` DVC downloads one file at a time, with `-j 2` it downloads two at + a time, and so forth. For SSH remotes default is set to 4. - `-h`, `--help` - prints the usage/help message, and exit. @@ -118,8 +116,8 @@ r1 ssh://_username_@_host_/path/to/dvc/cache/directory > DVC supports several remote types. For details, see the > [`remote add`](/doc/commands-reference/remote/add) documentation. -With a remote cache containing some images and other files, we can pull all -changed files from the current Git branch: +Having some images and other files in remote storage, we can pull all changed +files from the current Git branch: ```dvc $ dvc pull --remote r1 diff --git a/static/docs/commands-reference/push.md b/static/docs/commands-reference/push.md index 7ee6436c53..99ee1823fc 100644 --- a/static/docs/commands-reference/push.md +++ b/static/docs/commands-reference/push.md @@ -40,7 +40,7 @@ Under the hood a few actions are taken: exists, or not, in the remote simply by looking for it using the checksum. From this DVC gathers a list of files missing from the remote storage. -- Upload the cache files missing from the remote cache, if any, to the remote. +- Upload the cache files missing from remote storage, if any, to the remote. The DVC `push` command always works with a remote storage, and it is an error if none are specified on the command line nor in the configuration. If a @@ -50,18 +50,18 @@ and this [example](/doc/get-started/configure) for more information on how to configure a remote. With no arguments, just `dvc push` or `dvc push --remote REMOTE`, it uploads -only the files (or directories) that are new in the local repository to the -remote cache. It will not upload files associated with earlier versions or -branches of the project directory, nor will it upload files which -have not changed. +only the files (or directories) that are new in the local repository to remote +storage. It will not upload files associated with earlier versions or branches +of the project directory, nor will it upload files which have not +changed. The command `dvc status -c` can list files that are new in the local cache and are referenced in the workspace. It can be used to see what files `dvc push` would upload. -The `dvc status -c` command can show files which exist in the remote cache and -not exist in the local cache. Running `dvc push` from the local cache does not -remove nor modify those files in the remote cache. +The `dvc status -c` command can show files which exist in the remote but not in +the local cache. Running `dvc push` does not remove nor modify those files in +remote storage. If one or more `targets` are specified, DVC only considers the files associated with those DVC-files. Using the `--with-deps` option, DVC tracks dependencies @@ -71,12 +71,10 @@ to push. ## Options -- `-r REMOTE`, `--remote REMOTE` specifies which remote cache (see - `dvc remote list`) to push to. The value for `REMOTE` is a cache name defined - using the `dvc remote` command. If no `REMOTE` is given, or if no remote's are - defined in the project, an error message is printed. If the option is not - specified, then the default remote, configured with the `core.config` config - option, is used. +- `-r REMOTE`, `--remote REMOTE` specifies which remote to push from (see + `dvc remote list`). The value for `REMOTE` is a name defined using + `dvc remote`. If the option is not specified, then the default remote + (configured with the `core.config` config option) is used. - `-a`, `--all-branches` - determines the files to upload by examining DVC-files in all branches of the project repository (if using Git). It's useful if @@ -96,10 +94,10 @@ to push. each target directory and its subdirectories for DVC-files to inspect. - `-j JOBS`, `--jobs JOBS` - specifies number of jobs to run simultaneously - while uploading files to the remote cache. The effect is to control the number - of files uploaded simultaneously. Default is `4 * cpu_count()`. For example - with `-j 1` DVC uploads one file at a time, with `-j 2` it uploads two at a - time, and so forth. For SSH remotes default is set to 4. + while uploading files to the remote. The effect is to control the number of + files uploaded simultaneously. Default is `4 * cpu_count()`. For example with + `-j 1` DVC uploads one file at a time, with `-j 2` it uploads two at a time, + and so forth. For SSH remotes default is set to 4. - `-h`, `--help` - prints the usage/help message, and exit. @@ -164,7 +162,7 @@ Dvcfile ``` Imagine the local cache has been modified such that the data files in some of -these stages should be uploaded to the remote cache. +these stages should be uploaded to remote storage. ```dvc $ dvc status --cloud @@ -212,14 +210,14 @@ double check that all data had been uploaded. ## Example: What happens in the cache Let's take a detailed look at what happens to the DVC cache as you run an -experiment locally and push data to a remote cache. To set the example consider +experiment locally and push data to remote storage. To set the example consider having created a workspace that contains some code and data, and -having set up a remote cache. +having set up a remote. Some work has been performed in the local workspace, and it contains new data to -upload to the shared remote cache. When running `dvc status --cloud` the report -will list several files in `new` state. By looking in the cache directories we -can see exactly what that means. +upload to the shared remote. When running `dvc status --cloud` the report will +list several files in `new` state. By looking in the cache directories we can +see exactly what that means. ```dvc $ tree .dvc/cache @@ -262,16 +260,15 @@ $ tree ../vault/recursive ``` The directory `.dvc/cache` is the local cache, while `../vault/recursive` is the -remote cache. This listing clearly shows the local cache has more files in it -than the remote cache. Therefore `new` literally means that new files exist in -the local cache compared to the remote. +remote storage – a "local remote" in this case. This listing shows the local +cache having more files in it than the remote does (which is what `new` means). Next we can upload part of the data from the local cache to a remote using the command `dvc push --with-deps STAGE.dvc`. Remember that `--with-deps` searches backwards from the DVC-file `targets` to locate files to upload, and does not upload files in subsequent stages. -After doing that we can inspect the remote cache again: +After doing that we can inspect the remote storage again: ```dvc $ tree ../vault/recursive @@ -296,13 +293,13 @@ $ tree ../vault/recursive 8 directories, 8 files ``` -The remote cache now has some of the files which had been missing, but not all +The remote storage now has some of the files which had been missing, but not all of them. Indeed `dvc status --cloud` still lists a couple files as `new`. We can clearly see this in that a couple files are in the local cache and not in the -remote cache. +remote. -After running `dvc push` to cause all files to be uploaded the remote cache now -has all the files: +After running `dvc push` to cause all files to be uploaded, the remote storage +now contains all of them: ```dvc $ tree ../vault/recursive @@ -335,5 +332,5 @@ $ dvc status --cloud Data and pipelines are up to date. ``` -And running `dvc status --cloud` verifies that indeed there are no more files to -upload to the remote cache. +And running `dvc status --cloud`, DVC verifies that indeed there are no more +files to push to remote storage. diff --git a/static/docs/commands-reference/status.md b/static/docs/commands-reference/status.md index 51a30a8a7d..cc9e691612 100644 --- a/static/docs/commands-reference/status.md +++ b/static/docs/commands-reference/status.md @@ -3,7 +3,7 @@ Show changes in the project [pipelines](/doc/commands-reference/pipeline), as well as mismatches either between the local cache and local files, or between the cache and -remote cache. +remote storage. ## Synopsis @@ -21,14 +21,14 @@ positional arguments: `dvc status` searches for changes in the existing pipelines, either showing which [stages](/doc/commands-reference/run) have changed in the workspace and must be reproduced (with `dvc repro`), or differences -between local vs. remote cache (meaning `dvc push` or `dvc pull` -should be run to synchronize them). The two modes, _local_ and _cloud_ are -triggered by using the `--cloud` or `--remote` options: +between local cache vs. remote storage (meaning `dvc push` or `dvc pull` should +be run to synchronize them). The two modes, _local_ and _cloud_ are triggered by +using the `--cloud` or `--remote` options: | Mode | CLI Option | Description | | ------ | ---------- | ----------------------------------------------------------------------------------------------------------------------------- | | local | _none_ | Comparisons are made between data files in the workspace and corresponding files in the local cache (`.dvc/cache`) | -| remote | `--remote` | Comparisons are made between the local cache, and the given remote. Remote caches are defined using the `dvc remote` command. | +| remote | `--remote` | Comparisons are made between the local cache, and the given remote. Remote storage is defined using the `dvc remote` command. | | remote | `--cloud` | Comparisons are made between the local cache, and the default remote, defined with `dvc remote --default` command. | DVC determines data and code files to compare by analyzing all @@ -83,16 +83,16 @@ outputs described in it. the DVC-file is up to date, but there is no corresponding cache entry. -**For comparison against a remote cache:** +**For comparison against remote storage:** -- _new_ means the file exists in the local cache but not the remote cache -- _deleted_ means the file doesn't exist in the local cache, but exists in the - remote cache +- _new_ means the file exists in the local cache but not remote storage +- _deleted_ means the file doesn't exist in the local cache, but exists in + remote storage For either the _new_ and _deleted_ cases, the local cache (subset of it -determined by the current workspace) is different from the remote cache. -Bringing the two into sync requires `dvc pull` or `dvc push` to synchronize the -DVC cache. For the typical process to update the workspace, see +determined by the current workspace) is different from remote storage. Bringing +the two into sync requires `dvc pull` or `dvc push`. For the typical process to +update the workspace, see [Share Data And Model Files](/doc/use-cases/share-data-and-model-files). ## Options @@ -104,9 +104,9 @@ DVC cache. For the typical process to update the workspace, see will not show changes occurring in later stages than the `targets`. Applies whether or not `--cloud` is specified. -- `-c`, `--cloud` - enables comparison against a remote cache. If no `--remote` - option has been given, DVC will compare against the default remote cache, - which is specified in the `core.remote` config option. Otherwise the +- `-c`, `--cloud` - enables comparison against a remote. (See `dvc remote`.). If + no `--remote` option has been given, DVC will compare against the default + remote (specified in the `core.remote` config option). Otherwise the comparison will be against the remote specified in the `--remote` option. - `-r REMOTE`, `--remote REMOTE` - specifies which remote storage (see @@ -184,14 +184,14 @@ what files we have generated but haven't pushed to the remote yet: ```dvc $ dvc remote list -rcache s3://dvc-remote +storage s3://dvc-remote ``` And would like to check what files we have generated but haven't pushed to the remote yet: ```dvc -$ dvc status --remote rcache +$ dvc status --remote storage Preparing to collect status from s3://dvc-remote [##############################] 100% Collecting information @@ -201,5 +201,5 @@ Preparing to collect status from s3://dvc-remote new: data/matrix-test.p ``` -The output shows where the location of the remote cache as well as any -differences between the local cache and remote cache. +The output shows where the location of the remote storage is, as well as any +differences between the local cache and remote. From 182ff1be29cfec1def48b52937735c21713a4b7f Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 1 Sep 2019 23:38:46 -0500 Subject: [PATCH 08/26] term: review usage of "DVC" branding (3) through static/docs/commands-reference/gc.md (using i/[^`]DVC / regex) for #448 --- src/Documentation/glossary.js | 2 +- static/docs/changelog/0.35.md | 4 +- static/docs/commands-reference/add.md | 42 +++++++++---------- static/docs/commands-reference/cache/index.md | 10 ++--- static/docs/commands-reference/checkout.md | 22 +++++----- static/docs/commands-reference/commit.md | 18 ++++---- static/docs/commands-reference/config.md | 12 +++--- static/docs/commands-reference/destroy.md | 29 +++++++------ static/docs/commands-reference/diff.md | 8 ++-- static/docs/commands-reference/pull.md | 6 +-- static/docs/commands-reference/remote/add.md | 2 +- static/docs/commands-reference/repro.md | 6 +-- static/docs/commands-reference/run.md | 8 ++-- .../docs/understanding-dvc/core-features.md | 2 +- static/docs/understanding-dvc/how-it-works.md | 4 +- .../understanding-dvc/related-technologies.md | 8 ++-- static/docs/understanding-dvc/what-is-dvc.md | 2 +- .../user-guide/dvc-files-and-directories.md | 2 +- .../user-guide/large-dataset-optimization.md | 7 ++-- 19 files changed, 97 insertions(+), 97 deletions(-) diff --git a/src/Documentation/glossary.js b/src/Documentation/glossary.js index d9d12fee62..dedd061886 100644 --- a/src/Documentation/glossary.js +++ b/src/Documentation/glossary.js @@ -27,7 +27,7 @@ Initialized by running \`dvc init\` in the **workspace**. It will contain the }, { name: 'DVC Cache', - match: ['DVC cache', 'cache', 'cache directory', 'data cache', 'cached'], + match: ['DVC cache', 'cache', 'cache directory', 'cached'], desc: ` The DVC cache is a hidden storage (by default located in the \`.dvc/cache\` directory) for files that are under DVC control, and their different versions. diff --git a/static/docs/changelog/0.35.md b/static/docs/changelog/0.35.md index 9a852d0171..202c96fdd1 100644 --- a/static/docs/changelog/0.35.md +++ b/static/docs/changelog/0.35.md @@ -14,7 +14,7 @@ improvements) we have done in the last few months: - πŸ“– The [Get Started](/doc/get-started/agenda) section has been simplified (e.g. to use tags instead of branches) and extended. We have also prepared a - [Github DVC project](https://github.com/iterative/example-get-started) that + [DVC project on Github](https://github.com/iterative/example-get-started) that reflects the sequence of chapters in the β€œget started” guide. You can now download the whole project and reproduce all the models. @@ -63,7 +63,7 @@ improvements) we have done in the last few months: general user experience for the commands that navigate tags or branches (all the commands that include `--all-bracnhes`, `-a` or `--all-tags`, `-T`). -There are new [DVC integrations and plugins](/doc/user-guide/plugins) available: +There are new [integrations and plugins](/doc/user-guide/plugins) available: - Finally there is an official [Bash and Zsh completion](/doc/user-guide/autocomplete) for DVC! diff --git a/static/docs/commands-reference/add.md b/static/docs/commands-reference/add.md index bea3c4425a..48835b17f4 100644 --- a/static/docs/commands-reference/add.md +++ b/static/docs/commands-reference/add.md @@ -16,17 +16,17 @@ positional arguments: ## Description The `dvc add` command is analogous to the `git add` command. By default an added -file is committed to the DVC cache. Using the `--no-commit` option, the file -will not be added to the cache and instead the `dvc commit` command is used when -(or if) the file is to be committed to the DVC cache. +file is committed to the cache. Using the `--no-commit` option, the +file will not be added to the cache and instead the `dvc commit` command is used +when (or if) the file is to be committed to the cache. Under the hood, a few actions are taken for each file in `targets`: 1. Calculate the file checksum. -2. Move the file content to the DVC cache (default location is `.dvc/cache`). +2. Move the file content to the cache directory (by default in `.dvc/cache`). 3. Replace the file by a link to the file in the cache (see details below). 4. Create a corresponding [DVC-file](/doc/user-guide/dvc-file-format) and store - the MD5 checksum to identify the cache entry. + the MD5 checksum to identify the cached file. 5. Add the targets to `.gitignore` (if Git is used in this workspace) to prevent it from being committed to the Git repository. @@ -38,10 +38,10 @@ Under the hood, a few actions are taken for each file in `targets`: Unless the `-f` options is used, by default the DVC-file name generated is `.dvc`, where `` is file name of the first output (from `targets`). -The result is data file is added to the DVC cache, and DVC-files can be tracked -via Git or other version control system. The DVC-file lists the added file as an -output (`out`), and references the DVC cache entry using the checksum. See -[DVC-File Format](/doc/user-guide/dvc-file-format) for more details. +The result is data file is placed in the cache directory, and DVC-files can be +tracked via Git or other version control system. The DVC-file lists the added +file as an output (`out`), and references the cached file using the checksum. +See [DVC-File Format](/doc/user-guide/dvc-file-format) for more details. > Note that DVC-files created by this command are _orphans_: they have no > dependencies. _Orphan_ "stage files" are always considered _changed_ by @@ -59,14 +59,14 @@ to work with directory hierarchies with `dvc add`. 1. With `dvc add --recursive`, the hierarchy is traversed and every file is added individually as described above. This means every file has its own - DVC-file, and a corresponding DVC cache entry is made (unless `--no-commit` - flag is added). + DVC-file, and a corresponding cached file is created (unless the + `--no-commit` flag is used). 2. When not using `--recursive` a DVC-file is created for the top of the directory (with default name `dirname.dvc`). Every file in the hierarchy is - added to the DVC cache (unless `--no-commit` flag is added), but DVC does not + added to the cache (unless `--no-commit` flag is added), but DVC does not produce individual DVC-files for each file in the directory tree. Instead, - the single DVC-file points to a file in the DVC cache that contains - references to the files in the added hierarchy. + the single DVC-file points to a file in the cache that contains references to + the files in the added hierarchy. In a DVC project, `dvc add` can be used to version control any data artifact (input, intermediate, or output files and @@ -84,10 +84,10 @@ and make your project reproducible. found, a new DVC-file is created using the process described in this command's description. -- `--no-commit` - do not put files/directories into cache. A DVC-file is - created, and an entry is added to `.dvc/state`, while nothing is added to the - cache. Use `dvc commit` when you are ready to save your results to cache. This - is analogous to using `git add` before `git commit`. +- `--no-commit` - do not save outputs to cache. A DVC-file is created, and an + entry is added to `.dvc/state`, while nothing is added to the cache. This is + analogous to using `git add` before `git commit`. Use `dvc commit` when ready + to commit the results to cache. > The `dvc status` command will mention that the file is `not in cache`. @@ -194,9 +194,9 @@ Saving information to 'pics.dvc'. ``` There are no DVC-files generated within this directory structure, but the images -are all added to the DVC cache. DVC prints a message to that effect, saying that -`md5` values are computed for each directory. A DVC-file is generated for the -top-level directory, and it contains this: +are all added to the cache. DVC prints a message to that effect, +saying that `md5` values are computed for each directory. A DVC-file is +generated for the top-level directory, and it contains this: ```yaml md5: df06d8d51e6483ed5a74d3979f8fe42e diff --git a/static/docs/commands-reference/cache/index.md b/static/docs/commands-reference/cache/index.md index b0bd01f166..e404eee547 100644 --- a/static/docs/commands-reference/cache/index.md +++ b/static/docs/commands-reference/cache/index.md @@ -15,12 +15,12 @@ positional arguments: ## Description -After DVC initialization, a hidden directory `.dvc/` is created with the -[DVC internal files](/doc/user-guide/dvc-files-and-directories), including the -default cache directory. +After DVC initialization, a hidden directory `.dvc/` is created to contain the +[DVC files and directories](/doc/user-guide/dvc-files-and-directories), +including the default cache directory. -The DVC cache is where your data files, models, etc (anything you want to -version with DVC) are actually stored. The corresponding files you see in the +The cache is where your data files, models, etc (anything you want to version +with DVC) are actually stored. The corresponding files you see in the workspace simply link to the ones in cache. (See `dvc config cache`, `type` config option, for more information on file links on different platforms.) diff --git a/static/docs/commands-reference/checkout.md b/static/docs/commands-reference/checkout.md index 0b149f2124..8704a808b0 100644 --- a/static/docs/commands-reference/checkout.md +++ b/static/docs/commands-reference/checkout.md @@ -16,14 +16,15 @@ positional arguments: ## Description -[DVC-files](/doc/user-guide/dvc-file-format) in a DVC project -specify which instance of each data file or directory is to be used, using the -checksum saved in the `outs` fields. The `dvc checkout` command updates the -workspace data to match with the cache files corresponding to those checksums. +[DVC-files](/doc/user-guide/dvc-file-format) in a project specify +which instance of each data file or directory is to be used, using the checksum +saved in the `outs` fields. The `dvc checkout` command updates the workspace +data to match with the cached files corresponding to those +checksums. Using an SCM like Git, the DVC-files are kept under version control. At a given branch or tag of the SCM repository, the DVC-files will contain checksums for -the corresponding data files kept in the DVC cache. After an SCM command like +the corresponding data files kept in the cache. After an SCM command like `git checkout` is run, the DVC-files will change to the state at the specified branch or commit or tag. Afterwards, the `dvc checkout` command is required in order to synchronize the data files with the currently checked out DVC-files. @@ -64,8 +65,8 @@ restoring any file size will be almost instantaneous. > `cache.slow_link_warning` config option to `false` with `dvc config cache`. The output of `dvc checkout` does not list which data files were restored. It -does report removed files and files that DVC was unable to restore due to it -missing from the cache. +does report removed files and files that DVC was unable to restore because +they're missing from the cache. This command will fail to checkout files that are missing from the cache. In such a case, `dvc checkout` prints a warning message. Any files that can be @@ -90,10 +91,9 @@ be pulled from remote storage using `dvc pull`. inspect. - `-f`, `--force` - does not prompt when removing workspace files. Changing the - current set of DVC-files with SCM commands like `git checkout` can result in - the need for DVC to remove files which should not exist in the current state - and are missing in the local cache (they are not committed in DVC terms). This - option controls whether the user will be asked to confirm these files removal. + current set of DVC-files with `git checkout` can result in the need for DVC to + remove files that don't match those DVC-file references or are missing in the + local cache. (They are not "committed", in DVC terms.) - `-h`, `--help` - shows the help message and exit. diff --git a/static/docs/commands-reference/commit.md b/static/docs/commands-reference/commit.md index aa943a17fa..5980335899 100644 --- a/static/docs/commands-reference/commit.md +++ b/static/docs/commands-reference/commit.md @@ -1,8 +1,8 @@ # commit Record changes to the repository by updating -[DVC-files](/doc/user-guide/dvc-file-format) and saving outputs to -cache. +[DVC-files](/doc/user-guide/dvc-file-format) and saving outputs to cache +directory. ## Synopsis @@ -49,15 +49,15 @@ DVC can't guarantee reproducibility in those cases – You commit any data you want. Let's take a look at what is happening in the fist scenario closely: Normally DVC commands like `dvc add`, `dvc repro` or `dvc run` commit the data -to the DVC cache after creating a DVC-file. What _commit_ means is -that DVC: +to the cache after creating a DVC-file. What _commit_ means is that +DVC: - Computes a checksum for the file/directory - Enters the checksum and file name into the DVC-file - Tells the SCM to ignore the file/directory (e.g. add entry to `.gitignore`) (Note that if the workspace was initialized with no SCM support (`dvc init --no-scm`), this does not happen.) -- Adds the file/directory or to the DVC cache +- Adds the file/directory or to the cache directory There are many cases where the last step is not desirable (for example rapid iterations on an experiment). The `--no-commit` option prevents the last step @@ -65,7 +65,7 @@ from occurring (on the commands where it's available), saving time and space by not storing unwanted data artifacts. Checksums is still computed and added to the DVC-file, but the actual data file is not saved in the DVC cache. This is where the `dvc commit` command comes into play. It performs that -last step: storing the file in the DVC cache. +last step: storing the file in the cache directory. ## Options @@ -135,7 +135,7 @@ the cache with undesired intermediate results, we can run a single stage with `dvc run --no-commit`, or reproduce an entire pipeline using `dvc repro --no-commit`. This prevents data from being pushed to cache. When development of the stage is finished, `dvc commit` can be used to store data -files in the DVC cache. +files in the cache directory. In the `featurize.dvc` stage, `src/featurize.py` is executed. A useful change to make is adjusting a parameter to `CountVectorizer` in that script. Namely, @@ -173,8 +173,8 @@ train.dvc: not in cache: model.pkl ``` -And we can look in the DVC cache to see if the new version of `model.pkl` is -indeed _not in cache_ as claimed. Look at `train.dvc` first: +Now we can look in the cache directory to see if the new version of `model.pkl` +is indeed _not in cache_ as claimed. Look at `train.dvc` first: ```yaml cmd: python src/train.py data/features model.pkl diff --git a/static/docs/commands-reference/config.md b/static/docs/commands-reference/config.md index c6b6fce4aa..ec6f97a46a 100644 --- a/static/docs/commands-reference/config.md +++ b/static/docs/commands-reference/config.md @@ -1,6 +1,6 @@ # config -Get or set repository or global DVC config options. +Get or set project-level (or global) DVC configuration options. ## Synopsis @@ -51,7 +51,7 @@ corresponding config file. ## Configuration sections These are the `name` parameters that can be used with `dvc config`, or the -sections in the DVC project config file (`.dvc/config`). +sections in the project config file (`.dvc/config`). ### core @@ -83,10 +83,10 @@ remote. See `dvc remote` for more information. ### cache -The DVC cache is a hidden storage (by default located in the `.dvc/cache` -directory) for files that are under DVC control, and their different versions. -(See `dvc cache` and -[DVC internal files](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory) +A DVC project cache is the hidden storage (by default located in +the `.dvc/cache` directory) for files that are under DVC control, and their +different versions. (See `dvc cache` and +[DVC Files and Directories](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory) for more details.) - `cache.dir` - set/unset cache directory location. A correct value must be diff --git a/static/docs/commands-reference/destroy.md b/static/docs/commands-reference/destroy.md index bd35dbe1fe..686266986f 100644 --- a/static/docs/commands-reference/destroy.md +++ b/static/docs/commands-reference/destroy.md @@ -1,8 +1,8 @@ # destroy Remove all -[DVC files and directories](/doc/user-guide/dvc-files-and-directories) from the -project. +[DVC files and directories](/doc/user-guide/dvc-files-and-directories) from a +DVC project. ## Synopsis @@ -13,16 +13,17 @@ usage: dvc destroy [-h] [-q | -v] [-f] ## Description `dvc destroy` removes DVC-files, and the entire `.dvc/` meta directory from the -workspace. Note that the DVC cache will normally be -removed as well, unless it's set to an external location with `dvc cache dir`. -(By default a local cache is located in the `.dvc/cache` directory.) If you were -using [symlinks for linking data](/doc/user-guide/large-dataset-optimization) -from the cache, DVC will replace them with copies, so that your data is intact -after the DVC repository destruction. +workspace. Note that the cache directory will normally +be removed as well, unless it's set to an external location with +`dvc cache dir`. (By default a local cache is located in the `.dvc/cache` +directory.) If you were using +[symlinks for linking data](/doc/user-guide/large-dataset-optimization) from the +cache, DVC will replace them with copies, so that your data is intact after the +DVC repository destruction. ## Options -- `-f`, `--force` - do not prompt when destroying DVC project. +- `-f`, `--force` - do not prompt when destroying this project. - `-h`, `--help` - prints the usage/help message, and exit. @@ -42,8 +43,7 @@ $ ls -a .dvc .git code.py foo foo.dvc $ dvc destroy - -This will destroy all information about your pipelines, all data files, as well as cache in .dvc/cache. +This will destroy all information about your pipelines, all data files... Are you sure you want to continue? yes @@ -64,12 +64,11 @@ $ dvc cache dir /mnt/cache $ dvc add foo ``` -`dvc cache dir` changed the location of cache storage to external location. -Content of DVC repository: +`dvc cache dir` changed the location of the cache directory to an external +location. Content of workspace: ```dvc $ ls -a - .dvc .git code.py foo foo.dvc ``` @@ -87,7 +86,7 @@ Let's execute `dvc destroy`: ```dvc $ dvc destroy -This will destroy all information about your pipelines, all data files, as well as cache in .dvc/cache. +This will destroy all information about your pipelines, all data files... Are you sure you want to continue? [y/n] yes diff --git a/static/docs/commands-reference/diff.md b/static/docs/commands-reference/diff.md index faac084c0d..f0bee72ef4 100644 --- a/static/docs/commands-reference/diff.md +++ b/static/docs/commands-reference/diff.md @@ -1,10 +1,10 @@ # diff -Show changes between versions of the DVC repository. It can be narrowed down to -specific target files and directories under DVC control. +Show changes between versions of the DVC project. It can be +narrowed down to specific target files and directories under DVC control. -> This command requires the repository to be versioned with -> [Git](https://git-scm.com/). +> This command requires that the project is a [Git](https://git-scm.com/) +> repository. ## Synopsis diff --git a/static/docs/commands-reference/pull.md b/static/docs/commands-reference/pull.md index 365bbac036..b462237afd 100644 --- a/static/docs/commands-reference/pull.md +++ b/static/docs/commands-reference/pull.md @@ -82,9 +82,9 @@ reflinks or hardlinks to put it in the workspace without copying. See each target directory and its subdirectories for DVC-files to inspect. - `-f`, `--force` - does not prompt when removing workspace files, which occurs - when these file no longer match the DVC-file references. This option surfaces - behavior from the `dvc fetch` and `dvc checkout` commands because `dvc pull` - in effect performs those 2 functions in a single command. + when these file no longer match the current DVC-file references. This option + surfaces behavior from the `dvc fetch` and `dvc checkout` commands because + `dvc pull` in effect performs those 2 functions in a single command. - `-j JOBS`, `--jobs JOBS` - specifies number of jobs to run simultaneously while downloading files from the remote. The effect is to control the number diff --git a/static/docs/commands-reference/remote/add.md b/static/docs/commands-reference/remote/add.md index 9e90aeebcd..3e2ae3b22e 100644 --- a/static/docs/commands-reference/remote/add.md +++ b/static/docs/commands-reference/remote/add.md @@ -74,7 +74,7 @@ Use `dvc config` to unset/change the default remote as so: using this remote by default to save or retrieve data files unless `-r` option is specified for them. -- `-f`, `--force` - to overwrite existing remote with new `url` value. +- `-f`, `--force` - overwrite existing remote with new `url` value. ## Examples diff --git a/static/docs/commands-reference/repro.md b/static/docs/commands-reference/repro.md index 1f47c3cc4b..f182b827c4 100644 --- a/static/docs/commands-reference/repro.md +++ b/static/docs/commands-reference/repro.md @@ -63,9 +63,9 @@ specified), and updates stage files with the new checksum information. searching each target directory and its subdirectories for DVC-files to inspect. -- `--no-commit` - do not save outputs to cache. Useful when running different - experiments and you don't want to fill up the cache with temporary files. Use - `dvc commit` when ready to save results to cache. +- `--no-commit` - do not save outputs to cache. (See `dvc run`.) Useful when + running different experiments and you don't want to fill up the cache with + temporary files. Use `dvc commit` when ready to commit the results to cache. - `-m`, `--metrics` - show metrics after reproduction. The target pipelines must have at least one metrics file defined either with the `dvc metrics` command, diff --git a/static/docs/commands-reference/run.md b/static/docs/commands-reference/run.md index abbd014b67..b8a9424758 100644 --- a/static/docs/commands-reference/run.md +++ b/static/docs/commands-reference/run.md @@ -140,10 +140,10 @@ pipeline. default and deprecated. See `dvc remove` as well for more details. - `--no-commit` - do not save outputs to cache. A DVC-file is created, and an - entry is added to `.dvc/state`, while nothing is added to the cache. Use - `dvc commit` when you are ready to save your results to cache. Useful when - running different experiments and you don't want to fill up your cache with - temporary files. + entry is added to `.dvc/state`, while nothing is added to the cache. Useful + when running different experiments and you don't want to fill up your cache + with temporary files. Use `dvc commit` when ready to commit the results to + cache. > The `dvc status` command will mention that the file is `not in cache`. diff --git a/static/docs/understanding-dvc/core-features.md b/static/docs/understanding-dvc/core-features.md index eb49eaa9b1..667255b105 100644 --- a/static/docs/understanding-dvc/core-features.md +++ b/static/docs/understanding-dvc/core-features.md @@ -7,7 +7,7 @@ pipelines of DAGs. 3. **Large data file versioning** works by creating pointers in your Git - repository to the data cache on a local hard drive. + repository to the cache directory on a local hard drive. 4. **Programming language agnostic**: Python, R, Julia, shell scripts, etc. ML library agnostic: Keras, Tensorflow, PyTorch, scipy, etc. diff --git a/static/docs/understanding-dvc/how-it-works.md b/static/docs/understanding-dvc/how-it-works.md index a5f6dc6980..c3751dca9d 100644 --- a/static/docs/understanding-dvc/how-it-works.md +++ b/static/docs/understanding-dvc/how-it-works.md @@ -77,12 +77,12 @@ ```dvc $ git push - $ dvc push # push the data cache to the remote storage + $ dvc push # push from the cache to remote storage # On a colleague machine: $ git clone https://github.com/dataversioncontrol/myrepo.git $ cd myrepo - $ git pull # get the data cache from cloud + $ git pull # download the cache from remote storage $ dvc checkout # checkout data files $ ls -l data/ # You just got gigabytes of data through Git and DVC: diff --git a/static/docs/understanding-dvc/related-technologies.md b/static/docs/understanding-dvc/related-technologies.md index 3e9035fa7d..94ff187c7a 100644 --- a/static/docs/understanding-dvc/related-technologies.md +++ b/static/docs/understanding-dvc/related-technologies.md @@ -34,10 +34,10 @@ process. - DVC doesn't need to run any services. No graphical user interface as a result, but we expect some GUI services will be created on top of DVC. -- DVC has transparent design: - [meta files and directories](/doc/user-guide/dvc-files-and-directories) - (including the data cache) have a human-readable format and can - be easily reused by external tools. +- DVC has transparent design. Its + [internal files and directories](/doc/user-guide/dvc-files-and-directories) + (including the cache directory) have a human-readable format and + can be easily reused by external tools. 4. **Git workflows** and Git usage methodologies such as Gitflow. The differences are: diff --git a/static/docs/understanding-dvc/what-is-dvc.md b/static/docs/understanding-dvc/what-is-dvc.md index b62c41c6a6..a903a2e1b6 100644 --- a/static/docs/understanding-dvc/what-is-dvc.md +++ b/static/docs/understanding-dvc/what-is-dvc.md @@ -45,7 +45,7 @@ DVC uses a few core concepts: [DVC-files](/doc/user-guide/dvc-file-format) describing that data are stored in Git for DVC needs (to maintain pipelines and reproducibility). -- **Data cache**: Directory with all data files on a local hard drive or in +- **Cache directory**: Directory with all data files on a local hard drive or in cloud storage, but not in the Git repository. - **Cloud storage** support: available complement to the core DVC features. This diff --git a/static/docs/user-guide/dvc-files-and-directories.md b/static/docs/user-guide/dvc-files-and-directories.md index b7199d24ff..41417fc02e 100644 --- a/static/docs/user-guide/dvc-files-and-directories.md +++ b/static/docs/user-guide/dvc-files-and-directories.md @@ -40,7 +40,7 @@ directory (`.dvc/`) with special internal files and directories: - `.dvc/updater.lock`: Lock file for `.dvc/updater` -- `.dvc/lock`: Lock file for the whole DVC project +- `.dvc/lock`: Lock file for the entire DVC project ## Structure of cache directory diff --git a/static/docs/user-guide/large-dataset-optimization.md b/static/docs/user-guide/large-dataset-optimization.md index 70a7dd42ad..dae03bbec0 100644 --- a/static/docs/user-guide/large-dataset-optimization.md +++ b/static/docs/user-guide/large-dataset-optimization.md @@ -2,9 +2,10 @@ In order to track the data files and directories added with `dvc add` or `dvc run`, DVC moves all these files to a special cache directory. -The DVC cache is a hidden storage (by default located in `.dvc/cache`) for files -that are under DVC control, and their different versions. (See `dvc cache` and -[DVC internal files](/doc/user-guide/dvc-files-and-directories) for more +A DVC project cache is the hidden storage (by default located in +`.dvc/cache`) for files that are under DVC control, and their different +versions. (See `dvc cache` and +[DVC Files and Directories](/doc/user-guide/dvc-files-and-directories) for more details.) However, the versions of the tracked files that From 26d369296ce1c19c6078b9a8a64683a08e652b1b Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Mon, 2 Sep 2019 00:54:35 -0500 Subject: [PATCH 09/26] term: most "local cache" -> "cache directory" / "project cache" for #448 --- static/docs/commands-reference/checkout.md | 10 ++-- static/docs/commands-reference/config.md | 6 +-- static/docs/commands-reference/fetch.md | 34 +++++++------- static/docs/commands-reference/init.md | 6 +-- static/docs/commands-reference/pull.md | 12 ++--- static/docs/commands-reference/push.md | 47 +++++++++---------- .../docs/commands-reference/remote/index.md | 4 +- static/docs/commands-reference/status.md | 42 ++++++++--------- static/docs/tutorial/index.md | 4 +- static/docs/tutorial/sharing-data.md | 12 ++--- static/docs/understanding-dvc/how-it-works.md | 10 ++-- ...ple-data-scientists-on-a-single-machine.md | 10 ++-- static/docs/user-guide/external-outputs.md | 7 +-- .../user-guide/large-dataset-optimization.md | 2 +- 14 files changed, 102 insertions(+), 104 deletions(-) diff --git a/static/docs/commands-reference/checkout.md b/static/docs/commands-reference/checkout.md index 8704a808b0..04719c7400 100644 --- a/static/docs/commands-reference/checkout.md +++ b/static/docs/commands-reference/checkout.md @@ -93,7 +93,7 @@ be pulled from remote storage using `dvc pull`. - `-f`, `--force` - does not prompt when removing workspace files. Changing the current set of DVC-files with `git checkout` can result in the need for DVC to remove files that don't match those DVC-file references or are missing in the - local cache. (They are not "committed", in DVC terms.) + cache directory. (They are not "committed", in DVC terms.) - `-h`, `--help` - shows the help message and exit. @@ -205,10 +205,10 @@ MD5 (model.pkl) = a66489653d1b6a8ba989799367b32c43 ``` What happened is that DVC went through the sole existing DVC-file and adjusted -the current set of files to match the `outs` of that stage. `dvc fetch` command -runs once to download missing data from the remote storage to the local cache. -Alternatively, we could have just run `dvc pull` in this case to automatically -do `dvc fetch` + `dvc checkout`. +the current set of files to match the `outs` of that stage. `dvc fetch` runs +once to download missing data from the remote storage to the cache +directory. Alternatively, we could have just run `dvc pull` in this case +to automatically do `dvc fetch` + `dvc checkout`. ## Automating `dvc checkout` diff --git a/static/docs/commands-reference/config.md b/static/docs/commands-reference/config.md index ec6f97a46a..d0b760af85 100644 --- a/static/docs/commands-reference/config.md +++ b/static/docs/commands-reference/config.md @@ -137,9 +137,9 @@ for more details.) > These warnings are automatically turned off when `cache.type` is manually > set. -- `cache.local` - name of a local remote to use as local cache. This will - overwrite the value provided to `dvc config cache.dir` or `dvc cache dir`. - Refer to `dvc remote` for more information on "local remotes". +- `cache.local` - name of a local remote to use as cache directory. (Refer to + `dvc remote` for more information on "local remotes".) This will overwrite the + value provided to `dvc config cache.dir` or `dvc cache dir`. - `cache.ssh` - name of an [SSH remote to use as external cache](/doc/user-guide/external-outputs#ssh). diff --git a/static/docs/commands-reference/fetch.md b/static/docs/commands-reference/fetch.md index f61b4f0d6f..fb0ec30d0d 100644 --- a/static/docs/commands-reference/fetch.md +++ b/static/docs/commands-reference/fetch.md @@ -1,8 +1,8 @@ # fetch Get files that are under DVC control from -[remote](/doc/commands-reference/remote#description) storage into the local -cache. +[remote](/doc/commands-reference/remote#description) storage into the +cache directory. ## Synopsis @@ -19,11 +19,11 @@ positional arguments: ## Description The `dvc fetch` command is a means to download files from remote storage into -the local cache, but without placing them in the workspace. This -makes the data files available for linking (or copying) into the workspace. +the cache directory, but without placing them in the workspace. +This makes the data files available for linking (or copying) into the workspace. (Refer to [dvc config cache.type](/doc/commands-reference/config#cache).) Along with `dvc checkout`, it's performed automatically by `dvc pull` when the target -[DVC-files](/doc/user-guide/dvc-file-format) are not already in the local cache: +[DVC-files](/doc/user-guide/dvc-file-format) are not already in the cache: ``` Controlled files Commands @@ -34,7 +34,7 @@ remote storage | +------------+ | - - - - | dvc fetch | ++ v +------------+ + +----------+ -local cache ++ | dvc pull | +cache directory ++ | dvc pull | + +------------+ + +----------+ | - - - - |dvc checkout| ++ | +------------+ @@ -44,7 +44,7 @@ local cache ++ | dvc pull | Fetching could be useful when first checking out an existing DVC project, since files under DVC control could already exist in remote -storage, but won't be in your local cache. (Refer to `dvc remote` for more +storage, but won't be in the project's cache. (Refer to `dvc remote` for more information on DVC remotes.) These necessary data or model files are listed as dependencies or outputs in a DVC-file (target [stage](/doc/commands-reference/run)) so they are required to @@ -54,7 +54,7 @@ dependencies or outputs in a DVC-file (target dependencies and outputs.) `dvc fetch` ensures that the files needed for a DVC-file to be -[reproduced](/doc/get-started/reproduce) exist in the local cache. If no +[reproduced](/doc/get-started/reproduce) exist in the cache directory. If no `targets` are specified, the set of data files to fetch is determined by analyzing all DVC-files in the current branch, unless `--all-branches` or `--all-tags` is specified. @@ -163,8 +163,8 @@ bigrams-experiment <- use bigrams to improve the model This project comes with a predefined HTTP [remote storage](/doc/commands-reference/remote). We can now just run -`dvc fetch` that will download the most recent `model.pkl`, `data.xml`, and -other files that are under DVC control into our local cache: +`dvc fetch` to download the most recent `model.pkl`, `data.xml`, and other files +that are under DVC control into our local cache. ```dvc $ dvc status --cloud @@ -191,14 +191,15 @@ $ tree .dvc β”œβ”€β”€ ... ``` -> `dvc status --cloud` (or `-c`) compares local cache vs default remote. +> `dvc status --cloud` (or `-c`) compares the cache directory vs. the default +> remote. As seen above, used without arguments, `dvc fetch` downloads all assets needed by all DVC-files in the current branch, including for directories. The checksums `3863d0e317dee0a55c4e59d2ec0eef33` and `42c7025fc0edeb174069280d17add2d4` correspond to the `model.pkl` file and `data/features/` directory, respectively. -Let's link files from local cache to the workspace with: +Let's now link files from the cache to the workspace with: ```dvc $ dvc checkout @@ -242,8 +243,7 @@ checksums shown above. After following the previous example (**Specific stages**), only the files associated with the `prepare.dvc` stage file have been fetched. Several -dependencies/outputs of other pipeline stages are still missing from local -cache: +dependencies/outputs of other pipeline stages are still missing from the cache: ```dvc $ dvc status -c @@ -289,12 +289,12 @@ $ tree .dvc/cache Fetching using `--with-deps` starts with the target DVC-file (stage) and searches backwards through its pipeline for data files to download into the -local cache. All the data for the second and third stages ("featurize" and +cache directory. All the data for the second and third stages ("featurize" and "train") has now been downloaded to cache. We could now use `dvc checkout` to get the data files needed to reproduce this pipeline up to the third stage into the workspace (with `dvc repro train.dvc`). > Note that in this sample project, the last stage file `evaluate.dvc` doesn't > add any more data files than those form previous stages so at this point all -> of the files for this pipeline are in local cache and `dvc status -c` would -> output `Pipelines are up to date.` +> of the files for this pipeline are in the project's cache and `dvc status -c` +> would output `Pipelines are up to date.` diff --git a/static/docs/commands-reference/init.md b/static/docs/commands-reference/init.md index 8b6cbd0beb..5fea5a5986 100644 --- a/static/docs/commands-reference/init.md +++ b/static/docs/commands-reference/init.md @@ -22,7 +22,7 @@ manipulated directly. [DVC directories](/doc/user-guide/dvc-files-and-directories). It will hold all the contents of tracked data files. Note that `.dvc/.gitignore` lists this directory, which means that the cache directory is not under Git control. This -is your local cache and you cannot push it to any Git remote. +is a local cache and you cannot `git push` it. ## Options @@ -30,8 +30,8 @@ is your local cache and you cannot push it to any Git remote. written. - `-f`, `--force` - remove `.dvc/` if it exists before initialization. Will - remove all local cache. Useful when first `dvc init` got corrupted for some - reason. + remove any existing local cache. Useful when a previous `dvc init` has been + corrupted. - `-h`, `--help` - prints the usage/help message, and exit. diff --git a/static/docs/commands-reference/pull.md b/static/docs/commands-reference/pull.md index b462237afd..16e3017927 100644 --- a/static/docs/commands-reference/pull.md +++ b/static/docs/commands-reference/pull.md @@ -1,8 +1,8 @@ # pull Downloads missing files and directories from -[remote storage](/doc/commands-reference/remote) to the local cache -based on [DVC-files](/doc/user-guide/dvc-file-format) in the +[remote storage](/doc/commands-reference/remote) to the cache +directory based on [DVC-files](/doc/user-guide/dvc-file-format) in the workspace, then links the downloaded files into the workspace. ## Synopsis @@ -43,9 +43,9 @@ only the files (or directories) missing from the workspace by searching all versions or branches of the repository if using Git, nor will it download files which have not changed. -The command `dvc status -c` can list files that are missing in the local cache -but referenced in the current project DVC-files. It can be used to see what -files `dvc pull` would download. +The command `dvc status -c` can list files that are missing in the project's +cache, but referenced in its current DVC-files. It can be used to see what files +`dvc pull` would download. If one or more `targets` are specified, DVC only considers the files associated with those DVC-files. Using the `--with-deps` option, DVC tracks dependencies @@ -159,7 +159,7 @@ Dvcfile ``` Imagine the remote storage has been modified such that the data files in some of -these stages should be updated into the local cache. +these stages should be updated into the cache directory. ```dvc $ dvc status --cloud diff --git a/static/docs/commands-reference/push.md b/static/docs/commands-reference/push.md index 99ee1823fc..a2266b6048 100644 --- a/static/docs/commands-reference/push.md +++ b/static/docs/commands-reference/push.md @@ -36,9 +36,10 @@ Under the hood a few actions are taken: DVC-files to consult. - For each output referenced from each selected DVC-files, it finds a - corresponding entry in the local cache. DVC checks if the entry - exists, or not, in the remote simply by looking for it using the checksum. - From this DVC gathers a list of files missing from the remote storage. + corresponding entry in the cache directory. DVC checks if the + entry exists, or not, in the remote simply by looking for it using the + checksum. From this DVC gathers a list of files missing from the remote + storage. - Upload the cache files missing from remote storage, if any, to the remote. @@ -55,13 +56,9 @@ storage. It will not upload files associated with earlier versions or branches of the project directory, nor will it upload files which have not changed. -The command `dvc status -c` can list files that are new in the local cache and -are referenced in the workspace. It can be used to see what files -`dvc push` would upload. - -The `dvc status -c` command can show files which exist in the remote but not in -the local cache. Running `dvc push` does not remove nor modify those files in -remote storage. +The `dvc status -c` command can list files tracked by DVC that are new in the +cache directory (compared to the default remote.) It can be used to see what +files `dvc push` would upload. If one or more `targets` are specified, DVC only considers the files associated with those DVC-files. Using the `--with-deps` option, DVC tracks dependencies @@ -161,8 +158,8 @@ model.p.dvc Dvcfile ``` -Imagine the local cache has been modified such that the data files in some of -these stages should be uploaded to remote storage. +Imagine the project's cache has been modified such that the data files in some +of these stages should be uploaded to remote storage. ```dvc $ dvc status --cloud @@ -209,14 +206,14 @@ double check that all data had been uploaded. ## Example: What happens in the cache -Let's take a detailed look at what happens to the DVC cache as you run an -experiment locally and push data to remote storage. To set the example consider -having created a workspace that contains some code and data, and -having set up a remote. +Let's take a detailed look at what happens to the cache directory +as you run an experiment locally and push data to remote storage. To set the +example consider having created a workspace that contains some code +and data, and having set up a remote. Some work has been performed in the local workspace, and it contains new data to upload to the shared remote. When running `dvc status --cloud` the report will -list several files in `new` state. By looking in the cache directories we can +list several files in `new` state. By looking in the cached directories we can see exactly what that means. ```dvc @@ -260,13 +257,13 @@ $ tree ../vault/recursive ``` The directory `.dvc/cache` is the local cache, while `../vault/recursive` is the -remote storage – a "local remote" in this case. This listing shows the local -cache having more files in it than the remote does (which is what `new` means). +remote storage – a "local remote" in this case. This listing shows the cache +having more files in it than the remote does (which is what `new` means). -Next we can upload part of the data from the local cache to a remote using the -command `dvc push --with-deps STAGE.dvc`. Remember that `--with-deps` searches -backwards from the DVC-file `targets` to locate files to upload, and does not -upload files in subsequent stages. +Next we can upload part of the data from the cache directory to a remote using +the command `dvc push --with-deps STAGE.dvc`. Remember that `--with-deps` +searches backwards from the DVC-file `targets` to locate files to upload, and +does not upload files in subsequent stages. After doing that we can inspect the remote storage again: @@ -295,8 +292,8 @@ $ tree ../vault/recursive The remote storage now has some of the files which had been missing, but not all of them. Indeed `dvc status --cloud` still lists a couple files as `new`. We can -clearly see this in that a couple files are in the local cache and not in the -remote. +clearly see this in that a couple files are in the cache directory and not in +the remote. After running `dvc push` to cause all files to be uploaded, the remote storage now contains all of them: diff --git a/static/docs/commands-reference/remote/index.md b/static/docs/commands-reference/remote/index.md index 07cd187c22..8b8348fef8 100644 --- a/static/docs/commands-reference/remote/index.md +++ b/static/docs/commands-reference/remote/index.md @@ -30,8 +30,8 @@ remotes provide a central place to keep and share data and model files. With a remote data storage, you can pull models and data files which were created by your team members without spending time and resources to build or process them locally. It also saves space on your local environment – DVC can -[fetch](/doc/commands-reference/fetch) into the local cache only the data you -need for a specific branch/commit. +[fetch](/doc/commands-reference/fetch) into the cache directory +only the data you need for a specific branch/commit. > If you installed DVC via `pip`, depending on the remote type you plan to use > you might need to install optional dependencies: `[s3]`, `[ssh]`, `[gs]`, diff --git a/static/docs/commands-reference/status.md b/static/docs/commands-reference/status.md index cc9e691612..159193d4bd 100644 --- a/static/docs/commands-reference/status.md +++ b/static/docs/commands-reference/status.md @@ -2,8 +2,8 @@ Show changes in the project [pipelines](/doc/commands-reference/pipeline), as well as mismatches either -between the local cache and local files, or between the cache and -remote storage. +between the cache directory and workspace files, or +between the cache and remote storage. ## Synopsis @@ -19,17 +19,17 @@ positional arguments: ## Description `dvc status` searches for changes in the existing pipelines, either showing -which [stages](/doc/commands-reference/run) have changed in the -workspace and must be reproduced (with `dvc repro`), or differences -between local cache vs. remote storage (meaning `dvc push` or `dvc pull` should -be run to synchronize them). The two modes, _local_ and _cloud_ are triggered by -using the `--cloud` or `--remote` options: - -| Mode | CLI Option | Description | -| ------ | ---------- | ----------------------------------------------------------------------------------------------------------------------------- | -| local | _none_ | Comparisons are made between data files in the workspace and corresponding files in the local cache (`.dvc/cache`) | -| remote | `--remote` | Comparisons are made between the local cache, and the given remote. Remote storage is defined using the `dvc remote` command. | -| remote | `--cloud` | Comparisons are made between the local cache, and the default remote, defined with `dvc remote --default` command. | +which [stages](/doc/commands-reference/run) have changed in the workspace and +must be reproduced (with `dvc repro`), or differences between cache vs. remote +storage (meaning `dvc push` or `dvc pull` should be run to synchronize them). +The two modes, _local_ and _cloud_ are triggered by using the `--cloud` or +`--remote` options: + +| Mode | CLI Option | Description | +| ------ | ---------- | --------------------------------------------------------------------------------------------------------------------------- | +| local | _none_ | Comparisons are made between data files in the workspace and corresponding files in the cache directory (e.g. `.dvc/cache`) | +| remote | `--remote` | Comparisons are made between the cache, and the given remote. Remote storage is defined using the `dvc remote` command. | +| remote | `--cloud` | Comparisons are made between the cache, and the default remote, defined with `dvc remote --default` command. | DVC determines data and code files to compare by analyzing all [DVC-files](/doc/user-guide/dvc-file-format) in the project @@ -85,14 +85,14 @@ outputs described in it. **For comparison against remote storage:** -- _new_ means the file exists in the local cache but not remote storage -- _deleted_ means the file doesn't exist in the local cache, but exists in - remote storage +- _new_ means that the file/directory exists in the cache directory but not in + remote storage. +- _deleted_ means that the file/directory doesn't exist in the cache, but exists + in remote storage. -For either the _new_ and _deleted_ cases, the local cache (subset of it -determined by the current workspace) is different from remote storage. Bringing -the two into sync requires `dvc pull` or `dvc push`. For the typical process to -update the workspace, see +For either _new_ and _deleted_ data, the cache (subset determined by the current +workspace) is different from remote storage. Bringing the two into sync requires +`dvc pull` or `dvc push`. For the typical process to update the workspace, see [Share Data And Model Files](/doc/use-cases/share-data-and-model-files). ## Options @@ -202,4 +202,4 @@ Preparing to collect status from s3://dvc-remote ``` The output shows where the location of the remote storage is, as well as any -differences between the local cache and remote. +differences between the cache directory and remote. diff --git a/static/docs/tutorial/index.md b/static/docs/tutorial/index.md index 0b3722f7c1..eb80b091f6 100644 --- a/static/docs/tutorial/index.md +++ b/static/docs/tutorial/index.md @@ -25,7 +25,7 @@ and this approach will not require storing binary files in your Git repository. ## DVC Workflow -The diagram below describes all the DVC commands and relationships between local -cache and remote storage. +The diagram below describes all the DVC commands and relationships between a +local cache and remote storage. ![](/static/img/flow-large.png) diff --git a/static/docs/tutorial/sharing-data.md b/static/docs/tutorial/sharing-data.md index 0b1403004e..237321f2e7 100644 --- a/static/docs/tutorial/sharing-data.md +++ b/static/docs/tutorial/sharing-data.md @@ -7,13 +7,13 @@ repositories. These repositories will contain all the information needed for reproducibility and it might be a good idea to share these DVC-repositories using GitHub or other Git services. -DVC is able to push the cache to a cloud. +DVC is able to push the cache to cloud storage. -> Using your shared cache a colleague can reuse ML models that were trained on -> your machine. +> Using shared cloud storage, a colleague can reuse ML models that were trained +> on your machine. -First, you need to set a data remote which will be stored in the config file of -the project. This can be done using the CLI as shown below. +First, you need to set a remote storage which will be stored in the config file +of the project. This can be done using the CLI as shown below. > Note that we are using the `dvc-public` S3 bucket as an example and you don't > have write access to it, so in order to follow the tutorial you will need to @@ -28,7 +28,7 @@ $ git status -s M .dvc/config ``` -Then, a simple command pushes files from your local cache to the cloud: +Then, a simple command pushes files from your cache directory to the cloud: ```dvc $ dvc push diff --git a/static/docs/understanding-dvc/how-it-works.md b/static/docs/understanding-dvc/how-it-works.md index c3751dca9d..487afa48c8 100644 --- a/static/docs/understanding-dvc/how-it-works.md +++ b/static/docs/understanding-dvc/how-it-works.md @@ -48,8 +48,8 @@ ```dvc $ git checkout a03_normbatch_vgg16 # checkout code and DVC-files - $ dvc checkout # checkout data files from the local cache (not Git) - $ ls -l data/ # These LARGE files were copied from DVC cache, not from Git + $ dvc checkout # checkout data files from the cache directory + $ ls -l data/ # These LARGE files came from the cache, not from Git total 1017488 -r-------- 2 501 staff 273M Jan 27 03:48 Posts-test.tsv @@ -72,17 +72,17 @@ Rscript plot.R result.csv plots.jpg ``` -7. DVC's local cache can be transferred to your colleagues and partners through +7. A DVC project's cache can be shared with your colleagues and partners through AWS S3, Azure Blob Storage or GCP Storage: ```dvc $ git push - $ dvc push # push from the cache to remote storage + $ dvc push # push from the cache directory to remote storage # On a colleague machine: $ git clone https://github.com/dataversioncontrol/myrepo.git $ cd myrepo - $ git pull # download the cache from remote storage + $ git pull # download tracked data from remote storage $ dvc checkout # checkout data files $ ls -l data/ # You just got gigabytes of data through Git and DVC: diff --git a/static/docs/use-cases/multiple-data-scientists-on-a-single-machine.md b/static/docs/use-cases/multiple-data-scientists-on-a-single-machine.md index 85efd5c08a..634a455785 100644 --- a/static/docs/use-cases/multiple-data-scientists-on-a-single-machine.md +++ b/static/docs/use-cases/multiple-data-scientists-on-a-single-machine.md @@ -30,11 +30,11 @@ permissions. ### Transfer existing cache (Optional) -This step is optional. You can skip it if you are setting up a new DVC -repository and don't have your local cache stored in `.dvc/cache`. If you did -work on your project with DVC previously and you wish to transfer your cache to -the shared cache directory (external to your workspace), you will need to simply -move it from an old cache location to the new one: +This step is optional. You can skip it if you are setting up a new DVC project +whose cache directory is not stored in the default location, `.dvc/cache`. If +you did work on your project with DVC previously and you wish to transfer your +cache to the shared cache directory (external to your workspace), you will need +to simply move it from an old cache location to the new one: ```dvc $ mv .dvc/cache/* /path/to/dvc-cache diff --git a/static/docs/user-guide/external-outputs.md b/static/docs/user-guide/external-outputs.md index a9a0c03654..23318bec29 100644 --- a/static/docs/user-guide/external-outputs.md +++ b/static/docs/user-guide/external-outputs.md @@ -31,7 +31,8 @@ pointing to your desired files. For cached external outputs (specified using `-o`) you will need to [setup an external cache](/doc/commands-reference/config#cache) location that will be used by DVC to store versions of your external file. Non-cached external -outputs (specified using `-O`) do not require external cache to be setup. +outputs (specified using `-O`) do not require an external cache to +be setup. > Avoid using the same remote location that you are using for `dvc push`, > `dvc pull`, `dvc fetch` as external cache for your external outputs, because @@ -50,8 +51,8 @@ stage file (DVC-file). ### Local -Your local cache location already defaults to `.dvc/cache`, so there is no need -to specify it explicitly. +The default local cache location is `.dvc/cache`, so there is no need to specify +it explicitly. ```dvc $ dvc add /home/shared/mydata diff --git a/static/docs/user-guide/large-dataset-optimization.md b/static/docs/user-guide/large-dataset-optimization.md index dae03bbec0..db0f278d8c 100644 --- a/static/docs/user-guide/large-dataset-optimization.md +++ b/static/docs/user-guide/large-dataset-optimization.md @@ -2,7 +2,7 @@ In order to track the data files and directories added with `dvc add` or `dvc run`, DVC moves all these files to a special cache directory. -A DVC project cache is the hidden storage (by default located in +A DVC project's cache is the hidden storage (by default located in `.dvc/cache`) for files that are under DVC control, and their different versions. (See `dvc cache` and [DVC Files and Directories](/doc/user-guide/dvc-files-and-directories) for more From fa936465031b493ed15f61b6b5072acc99024deb Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Mon, 2 Sep 2019 01:13:49 -0500 Subject: [PATCH 10/26] term: data set -> dataset for #448 --- static/docs/changelog/0.18.md | 2 +- static/docs/understanding-dvc/collaboration-issues.md | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/static/docs/changelog/0.18.md b/static/docs/changelog/0.18.md index c4eb5e545a..d18c01746a 100644 --- a/static/docs/changelog/0.18.md +++ b/static/docs/changelog/0.18.md @@ -17,7 +17,7 @@ really excited to share the progress with you: - ⚑ **DVC just got faster** - Data files management commands like `dvc add`, `dvc push`, `dvc pull`, etc. - got up to 10x faster on data sets with large number of files. + got up to 10x faster on datasets with large number of files. - Commands startup latency reduced 3x diff --git a/static/docs/understanding-dvc/collaboration-issues.md b/static/docs/understanding-dvc/collaboration-issues.md index cd316ca06a..c6f71c8045 100644 --- a/static/docs/understanding-dvc/collaboration-issues.md +++ b/static/docs/understanding-dvc/collaboration-issues.md @@ -29,8 +29,8 @@ principled way: - How do you recover a model from last week without wasting time waiting for the model to retrain? -- How do you quickly switch between the large data source and a small data - subset without modifying source code? +- How do you quickly switch between the large dataset and a small subset without + modifying source code? 4. Reproducibility. From e91df63c52b37e3beae652bfcb79fad73b4464fb Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Mon, 2 Sep 2019 01:23:29 -0500 Subject: [PATCH 11/26] term: "run(s)/ran again" -> "regenerate" (repro context) for #448 --- static/docs/commands-reference/import-url.md | 4 ++-- static/docs/commands-reference/install.md | 6 +++--- static/docs/commands-reference/repro.md | 18 +++++++++--------- static/docs/commands-reference/status.md | 2 +- static/docs/tutorial/define-ml-pipeline.md | 2 +- static/docs/tutorial/reproducibility.md | 3 ++- 6 files changed, 18 insertions(+), 17 deletions(-) diff --git a/static/docs/commands-reference/import-url.md b/static/docs/commands-reference/import-url.md index 81d5fab644..81032c8f9c 100644 --- a/static/docs/commands-reference/import-url.md +++ b/static/docs/commands-reference/import-url.md @@ -326,7 +326,7 @@ Saving information to 'data.xml.dvc'. DVC has noticed the "external" data source has changed, and updated the import stage (reproduced it). In this case it's also necessary to run `dvc repro` so -that the rest of the pipeline is also run again. We can confirm so with: +that the rest of the pipeline is also regenerated. We can confirm so with: ```dvc $ dvc status @@ -348,6 +348,6 @@ $ dvc status Data and pipelines are up to date. ``` -`dvc repro` runs again the given stage `prepare.dvc`, noticing that its +`dvc repro` regenerates the given `prepare.dvc` stage, noticing that its dependency `data/data.xml` has changed. `dvc status` should report "Nothing to reproduce." after this. diff --git a/static/docs/commands-reference/install.md b/static/docs/commands-reference/install.md index 79d268e4d3..36b870a57f 100644 --- a/static/docs/commands-reference/install.md +++ b/static/docs/commands-reference/install.md @@ -285,6 +285,6 @@ Data and pipelines are up to date. After reproducing this pipeline up to the "evaluate" stage, the data files are in sync with the code/config files, but we must now commit the changes to the -Git repository. Looking closely we see that `dvc status` is run again, informing -us that the data files are synchronized with the `Pipelines are up to date.` -message. +Git repository. Looking closely we see that `dvc status` is used again, +informing us that the data files are synchronized with the +`Pipelines are up to date.` message. diff --git a/static/docs/commands-reference/repro.md b/static/docs/commands-reference/repro.md index f182b827c4..0bc982cb9a 100644 --- a/static/docs/commands-reference/repro.md +++ b/static/docs/commands-reference/repro.md @@ -1,9 +1,9 @@ # repro -Run again commands recorded in the [stages](/doc/commands-reference/run) of one -or more [pipelines](/doc/commands-reference/pipeline), in the correct order. The -commands to be run are determined by recursively analyzing target stages and -changes in their dependencies. +Regenerate [stages](/doc/commands-reference/run) of one or more +[pipelines](/doc/commands-reference/pipeline) by executing commands recorded in +them again, in the correct order. The commands to be executed are determined by +recursively analyzing target stages and changes in their dependencies. ## Synopsis @@ -24,7 +24,7 @@ positional arguments: project. (A pipeline is typically defined using the `dvc run` command, while data input nodes are defined by the `dvc add` command.) -There's a few ways to restrict the stages that will be run again by this +There's a few ways to restrict the stages that will be regenerated by this command: by specifying stage file `targets`, or by using the `--single-item`, `--cwd`, or other options. @@ -92,13 +92,13 @@ specified), and updates stage files with the new checksum information. `requirements.txt`, we can specify it only once in `A`, omitting it in `B` and `C`. To be precise , it reproduces all descendants of a changed stage or the stages following the changed stage, even if their direct dependencies did not - change. Like with the same option on `dvc run`, this is a way to force stages - without changes to run again. This can also be useful for pipelines containing - stages that produce nondeterministic (semi-random) outputs. For + change. Like with the same option on `dvc run`, this is a way to force + regenerating stages without changes. This can also be useful for pipelines + containing stages that produce nondeterministic (semi-random) outputs. For nondeterministic stages the outputs can vary on each execution, meaning the cache cannot be trusted for such stages. -- `--downstream` - only run again the stages after the given `targets` in their +- `--downstream` - only regenerate the stages after the given `targets` in their corresponding pipelines, including the target stages themselves. - `-h`, `--help` - prints the usage/help message, and exit. diff --git a/static/docs/commands-reference/status.md b/static/docs/commands-reference/status.md index 159193d4bd..8c73f1a33c 100644 --- a/static/docs/commands-reference/status.md +++ b/static/docs/commands-reference/status.md @@ -50,7 +50,7 @@ Data and pipelines are up to date. ``` This indicates that no differences were detected, and therefore no stages would -be run again by `dvc repro`. +be regenerated by `dvc repro`. If instead, differences are detected, `dvc status` lists those changes. For each DVC-file (stage) with differences, the changes in _dependencies_ and/or diff --git a/static/docs/tutorial/define-ml-pipeline.md b/static/docs/tutorial/define-ml-pipeline.md index 0463e49234..6ec7e8a017 100644 --- a/static/docs/tutorial/define-ml-pipeline.md +++ b/static/docs/tutorial/define-ml-pipeline.md @@ -398,4 +398,4 @@ focus is DVC, not ML modeling and we use a relatively small dataset without any advanced ML techniques. In the next chapter we will try to improve the metrics by changing our modeling -code and using reproducibility in our pipeline regeneration. +code and using reproducibility in our pipeline. diff --git a/static/docs/tutorial/reproducibility.md b/static/docs/tutorial/reproducibility.md index b60fe34830..d62bbebd49 100644 --- a/static/docs/tutorial/reproducibility.md +++ b/static/docs/tutorial/reproducibility.md @@ -86,7 +86,8 @@ Reproducing 'Dvcfile': The process started with the feature creation stage because one of its parameters was changed β€” the edited source code file `code/featurization.py`. -All dependent stages were ran again as well. +All dependent stages were regenerated as well. (See `--downstream` option in +`dvc repro`.) Let’s take a look at the metric’s change. The improvement is close to zero (+0.0075% to be precise): From b022c85b8db48d32eac61b107f802c468d855838 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Mon, 2 Sep 2019 23:57:29 -0500 Subject: [PATCH 12/26] term: review usage of "dependency graph" (and related), "DAG", and eliminate term "chain of command" for #448 --- .../docs/commands-reference/pipeline/index.md | 23 ++++--- .../docs/commands-reference/pipeline/show.md | 2 +- static/docs/commands-reference/repro.md | 14 +++-- static/docs/commands-reference/run.md | 63 +++++++++---------- .../docs/get-started/connect-code-and-data.md | 5 +- static/docs/get-started/example-pipeline.md | 18 +++--- static/docs/get-started/pipeline.md | 2 +- static/docs/get-started/reproduce.md | 16 ++--- static/docs/tutorial/reproducibility.md | 10 +-- .../docs/understanding-dvc/core-features.md | 3 +- .../docs/understanding-dvc/existing-tools.md | 2 +- .../understanding-dvc/related-technologies.md | 24 ++++--- static/docs/understanding-dvc/what-is-dvc.md | 8 ++- 13 files changed, 101 insertions(+), 89 deletions(-) diff --git a/static/docs/commands-reference/pipeline/index.md b/static/docs/commands-reference/pipeline/index.md index 3eb03b8294..40c5bb6175 100644 --- a/static/docs/commands-reference/pipeline/index.md +++ b/static/docs/commands-reference/pipeline/index.md @@ -17,16 +17,19 @@ positional arguments: ## Description -A data pipeline, in general, is a chain of commands that process data files. It -produces intermediate data and a final result. For example, Machine Learning -(ML) pipelines typically start a with large raw datasets, include featurization -and training intermediate stages, and produce a final model, as well as certain -metrics. - -In DVC, pipeline stage files and commands, their data I/O, interdependencies, -and results (intermediate or final) are defined with `dvc add` and `dvc run`, -among other commands. This allows us to form one or more pipelines of stages -connected by their dependencies and outputs. +A data pipeline, in general, is a series of data processes (for example console +commands that take an input and produce an output). A pipeline may produce +intermediate data, and has a final result. Machine Learning (ML) pipelines +typically start a with large raw datasets, include intermediate featurization +and training stages, and produce a final model, as well as accuracy metrics. + +In DVC, pipeline stages and commands, their data I/O, interdependencies, and +results (intermediate or final) are defined with `dvc add` and `dvc run`, among +other commands. This allows DVC to restore one or more pipelines of stages +interconnected by their dependencies and outputs later. (See `dvc repro`.) + +> DVC builds a dependency graph +> ([DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph)) to do this. `dvc pipeline` commands help users display the existing project pipelines in different ways. diff --git a/static/docs/commands-reference/pipeline/show.md b/static/docs/commands-reference/pipeline/show.md index 6775de8178..8c68007f68 100644 --- a/static/docs/commands-reference/pipeline/show.md +++ b/static/docs/commands-reference/pipeline/show.md @@ -118,7 +118,7 @@ $ dvc pipeline show eval.txt.dvc --ascii `--------------' ``` -List dependencies recursively if graph have tree structure: +List dependencies recursively if the graph has a tree structure: ```dvc $ dvc pipeline show e.file.dvc --tree diff --git a/static/docs/commands-reference/repro.md b/static/docs/commands-reference/repro.md index 0bc982cb9a..478d4e6db1 100644 --- a/static/docs/commands-reference/repro.md +++ b/static/docs/commands-reference/repro.md @@ -18,11 +18,15 @@ positional arguments: ## Description -`dvc repro` provides an interface to run the commands in a computational graph -(a.k.a. pipeline) again, as defined in the -[stage files](/doc/commands-reference/run) (DVC-files) found in the -project. (A pipeline is typically defined using the `dvc run` -command, while data input nodes are defined by the `dvc add` command.) +`dvc repro` provides an way to regenerate data pipelines, by restoring the +dependency graph (a [DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph)) +implicitly defined by [stage files](/doc/commands-reference/run) (DVC-files with +dependencies) that are found in the project. The commands defined +in these stages can then be executed in the correct order, reproducing pipeline +results. + +> Pipeline stages are typically defined using the `dvc run` command, while +> initial data dependencies can be registered by the `dvc add` command. There's a few ways to restrict the stages that will be regenerated by this command: by specifying stage file `targets`, or by using the `--single-item`, diff --git a/static/docs/commands-reference/run.md b/static/docs/commands-reference/run.md index b8a9424758..8e64912532 100644 --- a/static/docs/commands-reference/run.md +++ b/static/docs/commands-reference/run.md @@ -18,38 +18,36 @@ positional arguments: ## Description -`dvc run` provides an interface to build a computational graph (a.k.a. -pipeline). It's a way to describe commands, data inputs and intermediate results -that go into creating a ML model (or other data results). By explicitly -specifying a list of dependencies (with `-d` option) and outputs (with `-o`, -`-O`, `-m`, or `-M` options) DVC can connect each individual stage (command) -into a directed acyclic graph (DAG). All the remainder of command-line input -provided to `dvc run` after the optional arguments (`-` or `--` dashed options) -will become the required `command` argument. - -> Remember to wrap the `command` with `"` quotes if there are special characters -> in it like `|` (pipe) or `<`, `>` (redirection) that would otherwise apply to -> the entire `dvc run` command. E.g. -> `dvc run -d script.sh "./script.sh > /dev/null 2>&1"` Use single quotes `'` -> instead of `"` to wrap the `command` if there are environment variables in it, -> that you want to be evaluated dynamically. E.g. -> `dvc run -d script.sh './myscript.sh $MYENVVAR'` +`dvc run` provides an interface to describe stages: individual commands and the +data inputs and outputs that go into creating a data result. By specifying a +list of dependencies (`-d` option) and outputs (`-o`, `-O`, `-m`, or `-M` +options) DVC can later connect each stage by building a dependency graph +([DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph)). This graph is +used by DVC to restore a full data [pipeline](/doc/commands-reference/pipeline). + +The remainder of command-line input provided to `dvc run` after the options (`-` +or `--` arguments) will become the required `command` argument. Please wrap the +`command` with `"` quotes if there are special characters in it like `|` (pipe) +or `<`, `>` (redirection) that would otherwise apply to the entire `dvc run` +command e.g. `dvc run -d script.sh "./script.sh > /dev/null 2>&1"`. Use single +quotes `'` instead of `"` to wrap the `command` if there are environment +variables in it, that you want to be evaluated dynamically. E.g. +`dvc run -d script.sh './myscript.sh $MYENVVAR'` Unless the `-f` options is used, by default the DVC-file name generated is `.dvc`, where `` is file name of the first output (`-o`, `-O`, `-m`, or `-M` option). If neither `-f`, nor outputs are specified, the stage name defaults to `Dvcfile`. -Since `dvc run` provides a way to build a graph of computations, using -dependencies and outputs to connect different stages it checks computational -graph integrity properties before creating a new stage. For example, for every -output there should be only one stage that explicitly specifies it. There should -be no cycles, etc. +Since `dvc run` provides a way to build a dependency graph using dependencies +and outputs to connect different stages, it checks the graph's integrity before +creating a new stage. For example, for every output there should be only one +stage that explicitly specifies it. There should be no cycles, etc. Note that `dvc repro` provides an interface to check state and reproduce this -graph later. This concept is similar to the one of the `Makefile` but DVC -captures data and caches data artifacts along the way. See this -[example](/doc/get-started/example-pipeline) to learn more and try to build a +graph (pipeline) later. This concept is similar to the one of the `Makefile` but +DVC captures data and caches data artifacts along the way. See this +[example](/doc/get-started/example-pipeline) to learn more and try to create a pipeline. ## Options @@ -60,20 +58,19 @@ pipeline. configuration file. DVC also supports certain [external dependencies](/doc/user-guide/external-dependencies). - DVC builds a computation graph and this list of dependencies is a way to - connect different stages with each other. When you run `dvc repro` to - reproduce a stage (or when a stage is reproduced due to recursive dependency), - the list of dependencies helps DVC analyze whether any dependencies have - changed and thus running the stage again is required. A special case is when - no dependencies are specified. + DVC builds a dependency graph connecting different stages with each other. + When you run `dvc repro` to reproduce a stage (or when a stage is reproduced + due to recursive dependency), the list of dependencies helps DVC analyze + whether any dependencies have changed and thus running the stage again is + required. A special case is when no dependencies are specified. > Note that a DVC-file without dependencies is considered always _changed_, so > `dvc repro` always executes it. - `-o`, `--outs` - specify a file or a directory that are results of running the command. Multiple outputs can be specified like this: - `-o model.pkl -o output.log`. DVC is building a computation graph and this - list of outputs (along with dependencies described above) is a way to connect + `-o model.pkl -o output.log`. DVC is building a dependency graph and this list + of outputs (along with dependencies described above) is a way to connect different stages with each other. DVC takes all output files and directories under its control and will put them into the cache (this is similar to what's happening when you run `dvc add`). @@ -119,7 +116,7 @@ pipeline. take dependencies or outputs under DVC control. In the DVC-file contents, the `md5` hash sums will be empty; They will be populated the next time this stage is actually executed. This command is useful, if for example, you need to - build a pipeline (computational graph) first, and then run it all at once. + build a pipeline (dependency graph) first, and then run it all at once. - `-y`, `--yes` - deprecated, use `--overwrite-dvcfile` instead. diff --git a/static/docs/get-started/connect-code-and-data.md b/static/docs/get-started/connect-code-and-data.md index d8a0ac2aaa..416a107ee8 100644 --- a/static/docs/get-started/connect-code-and-data.md +++ b/static/docs/get-started/connect-code-and-data.md @@ -118,9 +118,8 @@ wdir: . ``` > `dvc run` is just the first of a set of DVC command required to generate a -> [pipeline](/doc/get-started/pipeline) computational graph, or in other words, -> instructions on how to build a ML model (data file) from previous data files -> (or directories). +> [pipeline](/doc/get-started/pipeline), or in other words, instructions on how +> to build a ML model (data file) from previous data files (or directories). We would recommend to read a few next chapters first, before switching to other documents. Hopefully, `dvc run` and `dvc repro` will make more sense after diff --git a/static/docs/get-started/example-pipeline.md b/static/docs/get-started/example-pipeline.md index 978ddb242f..38c854f824 100644 --- a/static/docs/get-started/example-pipeline.md +++ b/static/docs/get-started/example-pipeline.md @@ -168,9 +168,9 @@ is automatically added to the `.gitignore` file and a link is created into a cache `.dvc/cache/a3/04afb96060aad90176268345e10355` to save it. Two things are worth noticing here. First, by analyzing dependencies and outputs -that DVC-files describe, we can restore the full chain (DAG) of commands we need -to apply. This is important when you run `dvc repro` to reproduce the final or -intermediate result. +that DVC-files describe, we can restore the full series of commands (pipeline +stages) we need to apply. This is important when you run `dvc repro` to +reproduce the final or intermediate result. Second, you should see by now that the actual data is stored in the `.dvc/cache` directory, each file having a name in a form of an md5 hash. This cache is @@ -237,9 +237,9 @@ $ dvc run -d code/evaluate.py -d data/model.pkl -d data/matrix-test.pkl \ ### Expand to learn more about DVC internals -By analyzing dependencies and outputs in DVC-files, we can restore the full -chain of commands (DAG) we need to apply. This is important when you run -`dvc repro` to reproduce the final or intermediate result. +By analyzing dependencies and outputs in DVC-files, we can generate a dependency +graph: a series of commands DVC needs to execute. `dvc repro` does this in order +to restore a pipeline and reproduce its intermediate or final results. `dvc pipeline show` helps to visualize pipelines (run it with `-c` option to see actual commands instead of DVC-files): @@ -357,9 +357,9 @@ By wrapping your commands with `dvc run` it's easy to integrate DVC into your existing ML development pipeline/processes without any significant effort to rewrite your code. -The key step to notice is that DVC automatically derives the dependencies -between the experiment stages and builds the dependency graph (DAG) -transparently. +The key detail to notice is that DVC automatically derives the dependencies +between the defined stages by building dependency graphs that represent data +pipelines. Not only can DVC streamline your work into a single, reproducible environment, it also makes it easy to share this environment by Git including the diff --git a/static/docs/get-started/pipeline.md b/static/docs/get-started/pipeline.md index 0155cd1059..3d45608e38 100644 --- a/static/docs/get-started/pipeline.md +++ b/static/docs/get-started/pipeline.md @@ -5,7 +5,7 @@ difference between DVC and other version control tools that can handle large data files (e.g. `git lfs`). By using `dvc run` multiple times, and specifying outputs of a command (stage) as dependencies in another one, we can describe a sequence of commands that gets to a desired result. This is what we call a -**data pipeline** or computational graph. +**data pipeline** or dependency graph. Let's create a second stage (after `prepare.dvc`, created in the previous chapter) to perform feature extraction: diff --git a/static/docs/get-started/reproduce.md b/static/docs/get-started/reproduce.md index 642f3db9d3..1cf123b72f 100644 --- a/static/docs/get-started/reproduce.md +++ b/static/docs/get-started/reproduce.md @@ -1,11 +1,11 @@ # Reproduce -In the previous chapters, we described our first pipeline. Basically, we created -a number of [stage files](/doc/commands-reference/run). Each of these +In the previous chapters, we described our first +[pipeline]](/doc/commands-reference/pipeline). Basically, we created a number of +[stage files](/doc/commands-reference/run). Each of these [DVC-files](/doc/user-guide/dvc-file-format) describes single stage we need to -run towards a final result (a [pipeline]](/doc/commands-reference/pipeline)). -Each depends on some data (either raw data files or intermediate results from -previous stages) and code files. +run towards a final result (a pipeline). Each depends on some data (either raw +data files or intermediate results from previous stages) and code files. If you just cloned the [project](https://github.com/iterative/example-get-started), make sure you first @@ -31,9 +31,9 @@ that includes the data file in its outputs, get dependencies and commands, and so on. It means that DVC can recursively build a complete tree of commands it needs to execute to get the model file. -`dvc repro` is, essentially, building this execution graph, detects stages with -modified dependencies or missing outputs and recursively executes this graph -starting from these stages. +`dvc repro` essentially builds a dependency graph, detects stages with modified +dependencies or missing outputs and recursively executes commands (nodes in this +graph or pipeline) starting from the first stage with changes. Thus, `dvc run` and `dvc repro` provide a powerful framework for _reproducible experiments_ and _reproducible projects_. diff --git a/static/docs/tutorial/reproducibility.md b/static/docs/tutorial/reproducibility.md index d62bbebd49..ce45f1ea2f 100644 --- a/static/docs/tutorial/reproducibility.md +++ b/static/docs/tutorial/reproducibility.md @@ -10,14 +10,14 @@ The most exciting part of DVC is reproducibility. DVC tracks all the dependencies, which helps you iterate on ML models faster without thinking what was affected by your last change. -> In order to track all the dependencies, DVC finds and reads ALL the DVC-files -> in a repository and builds a dependency graph -> ([DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph)) based on these -> files. +In order to track all the dependencies, DVC finds and reads all the DVC-files in +a repository and builds a dependency graph +([pipeline](/doc/commands-reference/pipeline)) based on these files. This is one of the differences between DVC reproducibility and traditional Makefile-like build automation tools (Make, Maven, Ant, Rakefile etc). It was -designed in such a way to localize specification of DAG nodes. +designed in such a way to localize specification of the graph nodes (pipeline +[stages](/doc/commands-reference/run)). If you run `repro` on any [DVC-file](/doc/user-guide/dvc-file-format) from our repository, nothing happens because nothing was changed in the pipeline defined diff --git a/static/docs/understanding-dvc/core-features.md b/static/docs/understanding-dvc/core-features.md index 667255b105..9a730c1ae1 100644 --- a/static/docs/understanding-dvc/core-features.md +++ b/static/docs/understanding-dvc/core-features.md @@ -4,7 +4,8 @@ interface and Git workflow. 2. It makes data science projects **reproducible** by creating lightweight - pipelines of DAGs. + [pipelines](/doc/commands-reference/pipeline) using implicit dependency + graphs. 3. **Large data file versioning** works by creating pointers in your Git repository to the cache directory on a local hard drive. diff --git a/static/docs/understanding-dvc/existing-tools.md b/static/docs/understanding-dvc/existing-tools.md index 9d9b9eec97..370024af2d 100644 --- a/static/docs/understanding-dvc/existing-tools.md +++ b/static/docs/understanding-dvc/existing-tools.md @@ -6,7 +6,7 @@ There is one common opinion regarding data science tooling. Data scientists as engineers are supposed to use the best practices and collaboration software from software engineering. Source code version control system (Git), continuous integration services (CI), and unit test frameworks are all expected to be -utilized in data science pipelines. +utilized in data science [pipelines]](/doc/commands-reference/pipeline). But a comprehensive look at data science processes shows that the software engineering toolset does not cover data science needs. Try to answer all the diff --git a/static/docs/understanding-dvc/related-technologies.md b/static/docs/understanding-dvc/related-technologies.md index 94ff187c7a..fa3241db3e 100644 --- a/static/docs/understanding-dvc/related-technologies.md +++ b/static/docs/understanding-dvc/related-technologies.md @@ -13,8 +13,10 @@ process. should NOT be stored in a Git repository but still need to be tracked and versioned. -2. **Workflow management tools** (pipelines and DAGs): Airflow, Luigi, etc. The - differences are: +2. **Workflow management tools** ([pipelines]](/doc/commands-reference/pipeline) + and dependency graphs + ([DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph))): Airflow, + Luigi, etc. The differences are: - DVC is focused on data science and modeling. As a result, DVC pipelines are lightweight, easy to create and modify. However, DVC lacks pipeline execution @@ -51,18 +53,22 @@ process. 5. **Makefile** (and it's analogues). The differences are: -- DVC utilizes a DAG: +- DVC utilizes a + [directed acyclic graph](https://en.wikipedia.org/wiki/Directed_acyclic_graph) + (DAG): - - The DAG is defined by [DVC-files](/doc/user-guide/dvc-file-format) (with - file names `.dvc` or `Dvcfile`). + - The DAG or dependency graph is defined by the connections between + [DVC-file](/doc/user-guide/dvc-file-format) (with file names `.dvc` or + `Dvcfile`), based on their dependencies and outputs. - - One DVC-file defines one node in the DAG. All DVC-files in a repository make - up a single pipeline (think a single Makefile). All DVC-files (and + - Each DVC-file defines one node in the DAG. All DVC-files in a repository + make up a single pipeline (think a single Makefile). All DVC-files (and corresponding pipeline commands) are implicitly combined through their inputs and outputs, to simplify conflict resolving during merges. - - DVC provides a simple command `dvc run CMD` to generate a DVC-file - automatically based on the provided command, dependencies, and outputs. + - DVC provides a simple command `dvc run` to generate a DVC-file or "stage + file" automatically, based on the provided command, dependencies, and + outputs. - File tracking: diff --git a/static/docs/understanding-dvc/what-is-dvc.md b/static/docs/understanding-dvc/what-is-dvc.md index a903a2e1b6..9c08dae3ac 100644 --- a/static/docs/understanding-dvc/what-is-dvc.md +++ b/static/docs/understanding-dvc/what-is-dvc.md @@ -33,9 +33,11 @@ DVC uses a few core concepts: generates output files based on a set of input files and source code. This action usually changes experiment state. -- **Pipeline**: Directed acyclic graph (DAG) or chain of commands to reproduce - an experiment state. The commands are connected by input and output files. - Pipelines are defined by special **DVC-files** (which act like Makefiles). +- **Pipeline**: Dependency graph or series of commands to reproduce data + processing results. The commands are connected by input and output files + (dependencies). Pipelines are defined by special + [stage files](/doc/commands-reference/run) (similar to Makefiles). Refer to + [pipeline]](/doc/commands-reference/pipeline) for more information. - **Workflow**: Set of experiments and relationships among them. Workflow corresponds to the entire Git repository. From 16658125196d95a787ca0c11da56774c30475136 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Tue, 3 Sep 2019 00:03:01 -0500 Subject: [PATCH 13/26] cmd ref: update "Data and pipelines are up to date." phrase per https://github.com/iterative/dvc.org/pull/601#pullrequestreview-282710389 --- static/docs/commands-reference/fetch.md | 2 +- static/docs/commands-reference/install.md | 2 +- static/docs/commands-reference/status.md | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/static/docs/commands-reference/fetch.md b/static/docs/commands-reference/fetch.md index fb0ec30d0d..e0cab2f6a3 100644 --- a/static/docs/commands-reference/fetch.md +++ b/static/docs/commands-reference/fetch.md @@ -297,4 +297,4 @@ the workspace (with `dvc repro train.dvc`). > Note that in this sample project, the last stage file `evaluate.dvc` doesn't > add any more data files than those form previous stages so at this point all > of the files for this pipeline are in the project's cache and `dvc status -c` -> would output `Pipelines are up to date.` +> would output `Data and pipelines are up to date.` diff --git a/static/docs/commands-reference/install.md b/static/docs/commands-reference/install.md index 36b870a57f..dbc0521873 100644 --- a/static/docs/commands-reference/install.md +++ b/static/docs/commands-reference/install.md @@ -287,4 +287,4 @@ After reproducing this pipeline up to the "evaluate" stage, the data files are in sync with the code/config files, but we must now commit the changes to the Git repository. Looking closely we see that `dvc status` is used again, informing us that the data files are synchronized with the -`Pipelines are up to date.` message. +`Data and pipelines are up to date.` message. diff --git a/static/docs/commands-reference/status.md b/static/docs/commands-reference/status.md index 8c73f1a33c..bf0546c739 100644 --- a/static/docs/commands-reference/status.md +++ b/static/docs/commands-reference/status.md @@ -130,7 +130,7 @@ workspace) is different from remote storage. Bringing the two into sync requires - `-h`, `--help` - prints the usage/help message, and exit. - `-q`, `--quiet` - do not write anything to standard output. Exit with 0 if - Pipelines are up to date, otherwise 1. + data and pipelines are up to date, otherwise 1. - `-v`, `--verbose` - displays detailed tracing information. From 492cfc6a64c84d893ccd3c93a70ebc45f20b2078 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Tue, 3 Sep 2019 00:47:20 -0500 Subject: [PATCH 14/26] term: improve usage of "regenreate" and "execute" for stages/pipelines and their outputs per https://github.com/iterative/dvc.org/pull/601#issuecomment-527251667 --- static/docs/commands-reference/import-url.md | 9 +++-- static/docs/commands-reference/lock.md | 4 +- static/docs/commands-reference/repro.md | 40 +++++++++---------- static/docs/commands-reference/run.md | 20 +++++----- static/docs/commands-reference/status.md | 2 +- .../docs/get-started/connect-code-and-data.md | 3 +- static/docs/get-started/reproduce.md | 15 +++---- static/docs/tutorial/define-ml-pipeline.md | 2 +- static/docs/tutorial/reproducibility.md | 6 +-- static/docs/user-guide/dvc-file-format.md | 2 +- static/docs/user-guide/external-outputs.md | 8 ++-- static/docs/user-guide/update-tracked-file.md | 2 +- 12 files changed, 57 insertions(+), 56 deletions(-) diff --git a/static/docs/commands-reference/import-url.md b/static/docs/commands-reference/import-url.md index 81032c8f9c..77f5ec18ad 100644 --- a/static/docs/commands-reference/import-url.md +++ b/static/docs/commands-reference/import-url.md @@ -326,7 +326,8 @@ Saving information to 'data.xml.dvc'. DVC has noticed the "external" data source has changed, and updated the import stage (reproduced it). In this case it's also necessary to run `dvc repro` so -that the rest of the pipeline is also regenerated. We can confirm so with: +that the rest of the pipeline results are also regenerated. We can confirm so +with: ```dvc $ dvc status @@ -348,6 +349,6 @@ $ dvc status Data and pipelines are up to date. ``` -`dvc repro` regenerates the given `prepare.dvc` stage, noticing that its -dependency `data/data.xml` has changed. `dvc status` should report "Nothing to -reproduce." after this. +`dvc repro` executes the command defined in the given `prepare.dvc` stage after +noticing that its dependency `data/data.xml` has changed. `dvc status` should +report "Nothing to reproduce." after this. diff --git a/static/docs/commands-reference/lock.md b/static/docs/commands-reference/lock.md index efa457421b..084d3579ab 100644 --- a/static/docs/commands-reference/lock.md +++ b/static/docs/commands-reference/lock.md @@ -4,8 +4,8 @@ Lock a [DVC-file](/doc/user-guide/dvc-file-format) ([stage](/doc/commands-reference/run)). Use `dvc unlock` to unlock the file. If a DVC-file is locked, the stage is considered unchanged. `dvc repro` will not -run commands to rebuild outputs of locked stages, even if some dependencies have -changed and even if `--force` is provided. +execute commands to regenerate outputs of locked stages, even if some +dependencies have changed and even if `--force` is provided. ## Synopsis diff --git a/static/docs/commands-reference/repro.md b/static/docs/commands-reference/repro.md index 478d4e6db1..d046cd4262 100644 --- a/static/docs/commands-reference/repro.md +++ b/static/docs/commands-reference/repro.md @@ -1,9 +1,9 @@ # repro -Regenerate [stages](/doc/commands-reference/run) of one or more -[pipelines](/doc/commands-reference/pipeline) by executing commands recorded in -them again, in the correct order. The commands to be executed are determined by -recursively analyzing target stages and changes in their dependencies. +Reproduce complete or partial [pipelines](/doc/commands-reference/pipeline) by +executing commands defined in their [stages](/doc/commands-reference/run), in +the correct order. The commands to be executed are determined by recursively +analyzing dependencies and outputs of the target stages. ## Synopsis @@ -18,12 +18,12 @@ positional arguments: ## Description -`dvc repro` provides an way to regenerate data pipelines, by restoring the -dependency graph (a [DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph)) -implicitly defined by [stage files](/doc/commands-reference/run) (DVC-files with -dependencies) that are found in the project. The commands defined -in these stages can then be executed in the correct order, reproducing pipeline -results. +`dvc repro` provides an way to regenerate data pipeline results, by restoring +the dependency graph (a +[DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph)) implicitly defined +by [stage files](/doc/commands-reference/run) (DVC-files with dependencies) that +are found in the project. The commands defined in these stages can +then be executed in the correct order, reproducing pipeline results. > Pipeline stages are typically defined using the `dvc run` command, while > initial data dependencies can be registered by the `dvc add` command. @@ -47,11 +47,11 @@ specified), and updates stage files with the new checksum information. ## Options - `-f`, `--force` - reproduce a pipeline, regenerating its results, even if no - changes were found. By default this runs all of its stages but it can be + changes were found. By default this executes all of its stages but it can be limited with the `targets` argument and `-s`, `-p`, or `-c` options. - `-s`, `--single-item` - reproduce only a single stage by turning off the - recursive search for changed dependencies. Multiple stages are run + recursive search for changed dependencies. Multiple stages are executed (non-recursively) if multiple stage files are given as `targets`. - `-c`, `--cwd` - directory within the project to reproduce from. If no @@ -79,7 +79,7 @@ specified), and updates stage files with the new checksum information. executing the commands. - `-i`, `--interactive` - ask for confirmation before reproducing each stage. - The stage is only run if the user types "y". + The stage is only executed if the user types "y". - `-p`, `--pipeline` - reproduce the entire pipelines that the stage file `targets` belong to. Use `dvc pipeline show .dvc` to show the parent @@ -96,21 +96,21 @@ specified), and updates stage files with the new checksum information. `requirements.txt`, we can specify it only once in `A`, omitting it in `B` and `C`. To be precise , it reproduces all descendants of a changed stage or the stages following the changed stage, even if their direct dependencies did not - change. Like with the same option on `dvc run`, this is a way to force - regenerating stages without changes. This can also be useful for pipelines - containing stages that produce nondeterministic (semi-random) outputs. For + change. Like with the same option on `dvc run`, this is a way to force execute + stages without changes. This can also be useful for pipelines containing + stages that produce nondeterministic (semi-random) outputs. For nondeterministic stages the outputs can vary on each execution, meaning the cache cannot be trusted for such stages. -- `--downstream` - only regenerate the stages after the given `targets` in their +- `--downstream` - only execute the stages after the given `targets` in their corresponding pipelines, including the target stages themselves. - `-h`, `--help` - prints the usage/help message, and exit. - `-q`, `--quiet` - do not write anything to standard output. Exit with 0 if all - stages are up to date or if all stages are successfully run, otherwise exit - with 1. The command run by the stage is free to make output irregardless of - this flag. + stages are up to date or if all stages are successfully executed, otherwise + exit with 1. The command defined in the stage is free to write output + irregardless of this flag. - `-v`, `--verbose` - displays detailed tracing information. diff --git a/static/docs/commands-reference/run.md b/static/docs/commands-reference/run.md index 8e64912532..7df1374087 100644 --- a/static/docs/commands-reference/run.md +++ b/static/docs/commands-reference/run.md @@ -59,10 +59,9 @@ pipeline. [external dependencies](/doc/user-guide/external-dependencies). DVC builds a dependency graph connecting different stages with each other. - When you run `dvc repro` to reproduce a stage (or when a stage is reproduced - due to recursive dependency), the list of dependencies helps DVC analyze - whether any dependencies have changed and thus running the stage again is - required. A special case is when no dependencies are specified. + When you run `dvc repro`, the list of dependencies helps DVC analyze whether + any dependencies have changed and thus executing stages as required to + regenerate their output. A special case is when no dependencies are specified. > Note that a DVC-file without dependencies is considered always _changed_, so > `dvc repro` always executes it. @@ -112,11 +111,12 @@ pipeline. is used by `dvc repro` to change the working directory before running the command. -- `--no-exec` - create a stage file, but do not run the command specified nor - take dependencies or outputs under DVC control. In the DVC-file contents, the - `md5` hash sums will be empty; They will be populated the next time this stage - is actually executed. This command is useful, if for example, you need to - build a pipeline (dependency graph) first, and then run it all at once. +- `--no-exec` - create a stage file, but do not execute the command defined in + it, nor take dependencies or outputs under DVC control. In the DVC-file + contents, the `md5` hash sums will be empty; They will be populated the next + time this stage is actually executed. This command is useful, if for example, + you need to build a pipeline (dependency graph) first, and then run it all at + once. - `-y`, `--yes` - deprecated, use `--overwrite-dvcfile` instead. @@ -132,7 +132,7 @@ pipeline. some reason (meaning it produces different outputs from the same list of inputs). -- `--remove-outs` - it removes stage outputs before running the command. If +- `--remove-outs` - it removes stage outputs before executing the command. If `--no-exec` specified outputs are removed anyway. This option is enabled by default and deprecated. See `dvc remove` as well for more details. diff --git a/static/docs/commands-reference/status.md b/static/docs/commands-reference/status.md index bf0546c739..984cc50819 100644 --- a/static/docs/commands-reference/status.md +++ b/static/docs/commands-reference/status.md @@ -50,7 +50,7 @@ Data and pipelines are up to date. ``` This indicates that no differences were detected, and therefore no stages would -be regenerated by `dvc repro`. +be executed by `dvc repro`. If instead, differences are detected, `dvc status` lists those changes. For each DVC-file (stage) with differences, the changes in _dependencies_ and/or diff --git a/static/docs/get-started/connect-code-and-data.md b/static/docs/get-started/connect-code-and-data.md index 416a107ee8..c070d5ee98 100644 --- a/static/docs/get-started/connect-code-and-data.md +++ b/static/docs/get-started/connect-code-and-data.md @@ -135,7 +135,8 @@ readable. `-d src/prepare.py` and `-d data/data.xml` mean that the `prepare.dvc` stage file depends on them to produce the result. When you run `dvc repro` next time (see next chapter) DVC will automatically check these dependencies and decide -whether this stage is up to date or or whether it requires rebuilding. +whether this stage is up to date or or whether it should be executed to +regenerate its outputs. `-o data/prepared` specifies the output directory processed data will be put into. The script creates two files in it – that will be used later to generate diff --git a/static/docs/get-started/reproduce.md b/static/docs/get-started/reproduce.md index 1cf123b72f..064a173f9d 100644 --- a/static/docs/get-started/reproduce.md +++ b/static/docs/get-started/reproduce.md @@ -1,11 +1,12 @@ # Reproduce In the previous chapters, we described our first -[pipeline]](/doc/commands-reference/pipeline). Basically, we created a number of -[stage files](/doc/commands-reference/run). Each of these -[DVC-files](/doc/user-guide/dvc-file-format) describes single stage we need to -run towards a final result (a pipeline). Each depends on some data (either raw -data files or intermediate results from previous stages) and code files. +[pipeline]](/doc/commands-reference/pipeline). Basically, we generated a number +of [stage files](/doc/commands-reference/run) +([DVC-files](/doc/user-guide/dvc-file-format)). Each of these stages define +single commands to execute towards a final result. Each depends on some data +(either raw data files or intermediate results from previous stages) and code +files. If you just cloned the [project](https://github.com/iterative/example-get-started), make sure you first @@ -19,8 +20,8 @@ $ dvc repro train.dvc ``` > If you've just followed the previous chapters, the command above will have -> nothing to reproduce since you've already run all the pipeline stages. To -> easily try this command, clone this example +> nothing to reproduce since you've recently executed all the pipeline stages. +> To easily try this command, clone this example > [Github project](https://github.com/iterative/example-get-started) and run it > from there. diff --git a/static/docs/tutorial/define-ml-pipeline.md b/static/docs/tutorial/define-ml-pipeline.md index 6ec7e8a017..0a47e8022c 100644 --- a/static/docs/tutorial/define-ml-pipeline.md +++ b/static/docs/tutorial/define-ml-pipeline.md @@ -266,7 +266,7 @@ A single stage of our ML pipeline was defined and committed into repository. It isn't necessary to commit stages right after their creation. You can create a few and commit them to Git together later. -Let’s run the following stages: converting an XML file to TSV, and then +Let’s create the following stages: converting an XML file to TSV, and then separating training and testing datasets: ```dvc diff --git a/static/docs/tutorial/reproducibility.md b/static/docs/tutorial/reproducibility.md index ce45f1ea2f..8a07fac689 100644 --- a/static/docs/tutorial/reproducibility.md +++ b/static/docs/tutorial/reproducibility.md @@ -86,8 +86,7 @@ Reproducing 'Dvcfile': The process started with the feature creation stage because one of its parameters was changed β€” the edited source code file `code/featurization.py`. -All dependent stages were regenerated as well. (See `--downstream` option in -`dvc repro`.) +All dependent stages were executed as well. Let’s take a look at the metric’s change. The improvement is close to zero (+0.0075% to be precise): @@ -182,8 +181,7 @@ clf = RandomForestClassifier(n_estimators=700, n_jobs=6, random_state=seed) ``` -Only the modeling and the evaluation stage need to be reproduced. Just run -repro: +Only the modeling and the evaluation stage need to be reproduced. Just run: ```dvc $ dvc repro diff --git a/static/docs/user-guide/dvc-file-format.md b/static/docs/user-guide/dvc-file-format.md index 1f488525fb..ba60116aec 100644 --- a/static/docs/user-guide/dvc-file-format.md +++ b/static/docs/user-guide/dvc-file-format.md @@ -45,7 +45,7 @@ meta: # Special key to contain arbitary user data On the top level, `.dvc` file consists of these fields: -- `cmd`: Command that is being run in this stage +- `cmd`: Executable command defined in this stage - `deps`: List of dependencies for this stage - `outs`: List of outputs for this stage - `md5`: md5 checksum for this DVC-file diff --git a/static/docs/user-guide/external-outputs.md b/static/docs/user-guide/external-outputs.md index 23318bec29..92cfe38114 100644 --- a/static/docs/user-guide/external-outputs.md +++ b/static/docs/user-guide/external-outputs.md @@ -73,7 +73,7 @@ $ dvc config cache.s3 s3cache # Add data on S3 directly $ dvc add s3://mybucket/mydata -# Run the stage with external S3 output +# Create the stage with external S3 output $ dvc run -d data.txt \ -o s3://mybucket/data.txt \ aws s3 cp data.txt s3://mybucket/data.txt @@ -91,7 +91,7 @@ $ dvc config cache.gs gscache # Add data on GS directly $ dvc add gs://mybucket/mydata -# Run the stage with external GS output +# Create the stage with external GS output $ dvc run -d data.txt \ -o gs://mybucket/data.txt \ gsutil cp data.txt gs://mybucket/data.txt @@ -109,7 +109,7 @@ $ dvc config cache.ssh sshcache # Add data on SSH directly $ dvc add ssh://user@example.com:/mydata -# Run the stage with external SSH output +# Create the stage with external SSH output $ dvc run -d data.txt \ -o ssh://user@example.com:/home/shared/data.txt \ scp data.txt user@example.com:/home/shared/data.txt @@ -127,7 +127,7 @@ $ dvc config cache.hdfs hdfscache # Add data on HDFS directly $ dvc add hdfs://user@example.com/mydata -# Run the stage with external HDFS output +# Create the stage with external HDFS output $ dvc run -d data.txt \ -o hdfs://user@example.com/home/shared/data.txt \ hdfs fs -copyFromLocal \ diff --git a/static/docs/user-guide/update-tracked-file.md b/static/docs/user-guide/update-tracked-file.md index 09b63a9de6..10ca5ebc9e 100644 --- a/static/docs/user-guide/update-tracked-file.md +++ b/static/docs/user-guide/update-tracked-file.md @@ -16,7 +16,7 @@ may mean either replacing `train.tsv` with a new file having the same name or editing the content of the file. If you run `dvc repro` there is no need to manage generated (output) files -manually, DVC removes them for you before running the stage which generates +manually, DVC removes them for you before executing the stage which generates them. If you use DVC to track a file that is generated during your pipeline (e.g. some From d81791d63e307f00495e0928a0705a58ceb74c3e Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Tue, 3 Sep 2019 17:12:04 -0500 Subject: [PATCH 15/26] term: reduse usage of "again", especially in the contest of `dvc repro` for #448 --- static/docs/commands-reference/repro.md | 2 +- static/docs/commands-reference/run.md | 11 +++++------ static/docs/get-started/example-versioning.md | 8 ++++---- static/docs/get-started/metrics.md | 2 +- 4 files changed, 11 insertions(+), 12 deletions(-) diff --git a/static/docs/commands-reference/repro.md b/static/docs/commands-reference/repro.md index d046cd4262..0e99a9921a 100644 --- a/static/docs/commands-reference/repro.md +++ b/static/docs/commands-reference/repro.md @@ -37,7 +37,7 @@ omitted, `Dvcfile` will be assumed. By default, this command recursively searches in pipeline stages, starting from the `targets`, to determine which ones have changed. Then it executes the -corresponding commands again. +corresponding commands. `dvc repro` does not run `dvc fetch`, `dvc pull` or `dvc checkout` to get data files, intermediate or final results. It saves all the data files, intermediate diff --git a/static/docs/commands-reference/run.md b/static/docs/commands-reference/run.md index 7df1374087..a42a6f895d 100644 --- a/static/docs/commands-reference/run.md +++ b/static/docs/commands-reference/run.md @@ -125,12 +125,11 @@ pipeline. for confirmation. - `--ignore-build-cache` - if an exactly equal DVC-file exists (same list of - outputs and inputs, the same command to run) which has been already executed, - and is up to date, with option `dvc run` won't execute the command again by - default (thus "build cache"). This option gives a way to forcefully run the - command anyway. It's useful if the command is considered non-deterministic for - some reason (meaning it produces different outputs from the same list of - inputs). + outputs and inputs, the same command to run which has been already executed), + and is up to date, `dvc run` won't normally execute the command again (thus + "build cache"). This option gives a way to forcefully execute the command + anyway. It's useful if the command is non-deterministic (meaning it produces + different outputs from the same list of inputs). - `--remove-outs` - it removes stage outputs before executing the command. If `--no-exec` specified outputs are removed anyway. This option is enabled by diff --git a/static/docs/get-started/example-versioning.md b/static/docs/get-started/example-versioning.md index 496070e122..27023998f3 100644 --- a/static/docs/get-started/example-versioning.md +++ b/static/docs/get-started/example-versioning.md @@ -16,9 +16,9 @@ this example is to give you some hands-on experience with a very basic scenario ![](/static/img/cats-and-dogs.jpg) We first train a classifier model using 1000 labeled images, then we double the -number and run the training again. We capture both datasets and both results and -show how to use `dvc checkout` along with `git checkout` to switch between -different versions. +number and retrain our model. We capture both datasets and both results and show +how to use `dvc checkout` along with `git checkout` to switch between different +versions. The specific algorithm that is used to train and validate the classifier is not important. No prior knowledge is required about Keras. We reuse the @@ -207,7 +207,7 @@ data └── cat.1400.jpg ``` -Of course, we want to leverage these new labels and train the model again. +Of course, we want to leverage these new labels and retrain the model. ```dvc $ dvc add data diff --git a/static/docs/get-started/metrics.md b/static/docs/get-started/metrics.md index 9bc690bc3e..00828c4b1d 100644 --- a/static/docs/get-started/metrics.md +++ b/static/docs/get-started/metrics.md @@ -22,7 +22,7 @@ with a single number inside. > Please, refer to the `dvc metrics` command documentation to see more available > options and details. -Let's again commit and save results: +Let's save the updated results: ```dvc $ git add evaluate.dvc auc.metric From 65fbec32a741c3b929017470ecdb68a44f5610b3 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Tue, 3 Sep 2019 23:33:14 -0500 Subject: [PATCH 16/26] glossary: update "workspace" term, and improve related user-guide descriptions related to https://github.com/iterative/dvc/issues/2455#issuecomment-527295190 --- src/Documentation/glossary.js | 6 ++++-- static/docs/commands-reference/add.md | 19 +++++++++++-------- .../docs/user-guide/external-dependencies.md | 13 +++++++------ static/docs/user-guide/external-outputs.md | 11 ++++++----- 4 files changed, 28 insertions(+), 21 deletions(-) diff --git a/src/Documentation/glossary.js b/src/Documentation/glossary.js index dedd061886..4ab94d21cf 100644 --- a/src/Documentation/glossary.js +++ b/src/Documentation/glossary.js @@ -11,8 +11,10 @@ export default { Directory containing all your project files. For example raw datasets, source code, ML models, etc. A workspace becomes a **DVC project** when [\`dvc init\`](/doc/commands-reference/init) is run, and -[DVC-files](/doc/user-guide/dvc-file-format) (or stage files) are created in -it. +[DVC-files](/doc/user-guide/dvc-file-format) or stage files are created in it. + +Note that [external outputs](/doc/user-guide/external-outputs) also form part +of your expanded workspace, technically. ` }, { diff --git a/static/docs/commands-reference/add.md b/static/docs/commands-reference/add.md index 48835b17f4..2ac5599251 100644 --- a/static/docs/commands-reference/add.md +++ b/static/docs/commands-reference/add.md @@ -27,21 +27,24 @@ Under the hood, a few actions are taken for each file in `targets`: 3. Replace the file by a link to the file in the cache (see details below). 4. Create a corresponding [DVC-file](/doc/user-guide/dvc-file-format) and store the MD5 checksum to identify the cached file. -5. Add the targets to `.gitignore` (if Git is used in this +5. Add the `targets` to `.gitignore` (if Git is used in this workspace) to prevent it from being committed to the Git repository. 6. Instructions are printed showing `git` commands for adding the files to a Git repository. If a different SCM system is being used, use the equivalent - command for that system or nothing is printed if `--no-scm` was specified for - the repository. + command for that system. Nothing is printed if `--no-scm` was specified when + [initializing](/doc/commands-reference/init) the project. -Unless the `-f` options is used, by default the DVC-file name generated is -`.dvc`, where `` is file name of the first output (from `targets`). +Note that `targets` outside the current workspace are supported, creating +[external outputs](/doc/user-guide/external-outputs). + +Unless the `-f` options is used, the DVC-file name generated is `.dvc` by +default, where `` is file name of the first output (from `targets`). The result is data file is placed in the cache directory, and DVC-files can be -tracked via Git or other version control system. The DVC-file lists the added -file as an output (`out`), and references the cached file using the checksum. -See [DVC-File Format](/doc/user-guide/dvc-file-format) for more details. +tracked via SCM. The DVC-file lists the added file as an output (`outs` field), +and references the cached file using the checksum. See +[DVC-File Format](/doc/user-guide/dvc-file-format) for more details. > Note that DVC-files created by this command are _orphans_: they have no > dependencies. _Orphan_ "stage files" are always considered _changed_ by diff --git a/static/docs/user-guide/external-dependencies.md b/static/docs/user-guide/external-dependencies.md index 5d3f392f6e..387cab9ccc 100644 --- a/static/docs/user-guide/external-dependencies.md +++ b/static/docs/user-guide/external-dependencies.md @@ -1,11 +1,12 @@ # External Dependencies -There are cases when data is large enough or processing is organized in a way -that you would like to avoid moving data out of the remote storage. For example, -you are processing data on HDFS, running Dask via SSH, or have a script that -streams data from S3 to process it, etc. A mechanism of external dependencies -and [External Outputs](/doc/user-guide/external-outputs) provides a way for DVC -to control data externally. +There are cases when data is so large, or its processing is organized in a way +that you would like to avoid moving it out of its external/remote location. For +example from a network attached storage (NAS) drive, processing data on HDFS, +running [Dask](https://dask.org/) via SSH, or having a script that streams data +from S3 to process it. A mechanism for external dependencies and +[external outputs](/doc/user-guide/external-outputs) provides a way for DVC to +control data externally. ## Description diff --git a/static/docs/user-guide/external-outputs.md b/static/docs/user-guide/external-outputs.md index 92cfe38114..3463db1e09 100644 --- a/static/docs/user-guide/external-outputs.md +++ b/static/docs/user-guide/external-outputs.md @@ -1,10 +1,11 @@ # Managing External Data -There are cases when data is large enough or processing is organized in a way -that you would like to avoid moving data out of the remote storage. For example, -you are processing data on HDFS, running Dask via SSH, or have a script that -streams data from S3 to process it, etc. A mechanism of external outputs and -[External Dependencies](/doc/user-guide/external-dependencies) provides a way +There are cases when data is so large, or its processing is organized in a way +that you would like to avoid moving it out of its external/remote location. For +example from a network attached storage (NAS) drive, processing data on HDFS, +running [Dask](https://dask.org/) via SSH, or having a script that streams data +from S3 to process it. A mechanism for external outputs and +[external dependencies](/doc/user-guide/external-dependencies) provides a way for DVC to control data externally. ## Description From 3f7884bdd3b32ba0d9307469ddf0249b71cd1356 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 4 Sep 2019 20:00:59 -0500 Subject: [PATCH 17/26] term: stop using glossary entry "cache directory", related updates per https://github.com/iterative/dvc.org/pull/601#pullrequestreview-282712563 --- src/Documentation/glossary.js | 2 +- static/docs/commands-reference/add.md | 37 ++++++------- static/docs/commands-reference/cache/dir.md | 2 +- static/docs/commands-reference/cache/index.md | 2 +- static/docs/commands-reference/checkout.md | 12 ++-- static/docs/commands-reference/commit.md | 49 ++++++++--------- static/docs/commands-reference/fetch.md | 55 ++++++++++--------- static/docs/commands-reference/init.md | 7 ++- .../docs/commands-reference/metrics/index.md | 2 +- static/docs/commands-reference/pull.md | 20 +++---- static/docs/commands-reference/push.md | 35 ++++++------ .../docs/commands-reference/remote/modify.md | 15 ++--- static/docs/commands-reference/status.md | 11 ++-- static/docs/get-started/initialize.md | 2 +- static/docs/tutorial/define-ml-pipeline.md | 10 ++-- static/docs/tutorial/preparation.md | 2 +- static/docs/tutorial/sharing-data.md | 2 +- .../docs/understanding-dvc/core-features.md | 2 +- static/docs/understanding-dvc/how-it-works.md | 8 +-- .../understanding-dvc/related-technologies.md | 2 +- static/docs/understanding-dvc/what-is-dvc.md | 6 +- .../user-guide/dvc-files-and-directories.md | 6 +- .../user-guide/large-dataset-optimization.md | 14 ++--- 23 files changed, 152 insertions(+), 151 deletions(-) diff --git a/src/Documentation/glossary.js b/src/Documentation/glossary.js index 4ab94d21cf..7a4f946619 100644 --- a/src/Documentation/glossary.js +++ b/src/Documentation/glossary.js @@ -29,7 +29,7 @@ Initialized by running \`dvc init\` in the **workspace**. It will contain the }, { name: 'DVC Cache', - match: ['DVC cache', 'cache', 'cache directory', 'cached'], + match: ['DVC cache', 'cache', 'cached'], desc: ` The DVC cache is a hidden storage (by default located in the \`.dvc/cache\` directory) for files that are under DVC control, and their different versions. diff --git a/static/docs/commands-reference/add.md b/static/docs/commands-reference/add.md index 2ac5599251..e8f5f582a3 100644 --- a/static/docs/commands-reference/add.md +++ b/static/docs/commands-reference/add.md @@ -1,7 +1,7 @@ # add Take a data file or a directory under DVC control (by creating a corresponding -DVC-file). +[DVC-file](/doc/user-guide/dvc-file-format)). ## Synopsis @@ -20,32 +20,31 @@ file is committed to the cache. Using the `--no-commit` option, the file will not be added to the cache and instead the `dvc commit` command is used when (or if) the file is to be committed to the cache. -Under the hood, a few actions are taken for each file in `targets`: +Under the hood, a few actions are taken for each file (or directory) in +`targets`: 1. Calculate the file checksum. -2. Move the file content to the cache directory (by default in `.dvc/cache`). -3. Replace the file by a link to the file in the cache (see details below). +2. Move the file contents to the cache directory (by default in `.dvc/cache`), + using the checksum to form the cached file name. +3. Replace the file by a link to the file in cache (see details below). 4. Create a corresponding [DVC-file](/doc/user-guide/dvc-file-format) and store - the MD5 checksum to identify the cached file. -5. Add the `targets` to `.gitignore` (if Git is used in this - workspace) to prevent it from being committed to the Git - repository. + the checksum to identify the cached file. Unless the `-f` options is used, + the DVC-file name generated is `.dvc` by default, where `` is + file name of the first output (from `targets`). +5. Add the `targets` in the workspace to `.gitignore` to prevent it + from being committed to the Git repository, unless `--no-scm` was used when + [initializing](/doc/commands-reference/init) this project. 6. Instructions are printed showing `git` commands for adding the files to a Git - repository. If a different SCM system is being used, use the equivalent - command for that system. Nothing is printed if `--no-scm` was specified when - [initializing](/doc/commands-reference/init) the project. + repository, unless `--no-scm` was used. + +The result is that the target data gets cached by DVC, and instead small +DVC-files can be tracked with Git. The DVC-file lists the added file as an +output (`outs` field), and references the cached file using the checksum. See +[DVC-File Format](/doc/user-guide/dvc-file-format) for more details. Note that `targets` outside the current workspace are supported, creating [external outputs](/doc/user-guide/external-outputs). -Unless the `-f` options is used, the DVC-file name generated is `.dvc` by -default, where `` is file name of the first output (from `targets`). - -The result is data file is placed in the cache directory, and DVC-files can be -tracked via SCM. The DVC-file lists the added file as an output (`outs` field), -and references the cached file using the checksum. See -[DVC-File Format](/doc/user-guide/dvc-file-format) for more details. - > Note that DVC-files created by this command are _orphans_: they have no > dependencies. _Orphan_ "stage files" are always considered _changed_ by > `dvc repro`, which always executes them. diff --git a/static/docs/commands-reference/cache/dir.md b/static/docs/commands-reference/cache/dir.md index 846ff2c390..b06cc577d6 100644 --- a/static/docs/commands-reference/cache/dir.md +++ b/static/docs/commands-reference/cache/dir.md @@ -1,6 +1,6 @@ # cache dir -Set/unset the cache directory location intuitively (compared to +Set/unset the cache directory location intuitively (compared to using `dvc config cache`). ## Synopsis diff --git a/static/docs/commands-reference/cache/index.md b/static/docs/commands-reference/cache/index.md index e404eee547..46e030c10b 100644 --- a/static/docs/commands-reference/cache/index.md +++ b/static/docs/commands-reference/cache/index.md @@ -1,6 +1,6 @@ # cache -Contains a helper command to set the cache directory location: +Contains a helper command to set the cache directory location: [dir](/doc/commands-reference/cache/dir). ## Synopsis diff --git a/static/docs/commands-reference/checkout.md b/static/docs/commands-reference/checkout.md index 04719c7400..33e3f275ab 100644 --- a/static/docs/commands-reference/checkout.md +++ b/static/docs/commands-reference/checkout.md @@ -66,7 +66,7 @@ restoring any file size will be almost instantaneous. The output of `dvc checkout` does not list which data files were restored. It does report removed files and files that DVC was unable to restore because -they're missing from the cache. +they're missing from the cache. This command will fail to checkout files that are missing from the cache. In such a case, `dvc checkout` prints a warning message. Any files that can be @@ -92,8 +92,8 @@ be pulled from remote storage using `dvc pull`. - `-f`, `--force` - does not prompt when removing workspace files. Changing the current set of DVC-files with `git checkout` can result in the need for DVC to - remove files that don't match those DVC-file references or are missing in the - cache directory. (They are not "committed", in DVC terms.) + remove files that don't match those DVC-file references or are missing from + cache. (They are not "committed", in DVC terms.) - `-h`, `--help` - shows the help message and exit. @@ -206,9 +206,9 @@ MD5 (model.pkl) = a66489653d1b6a8ba989799367b32c43 What happened is that DVC went through the sole existing DVC-file and adjusted the current set of files to match the `outs` of that stage. `dvc fetch` runs -once to download missing data from the remote storage to the cache -directory. Alternatively, we could have just run `dvc pull` in this case -to automatically do `dvc fetch` + `dvc checkout`. +once to download missing data from the remote storage to the cache. +Alternatively, we could have just run `dvc pull` in this case to automatically +do `dvc fetch` + `dvc checkout`. ## Automating `dvc checkout` diff --git a/static/docs/commands-reference/commit.md b/static/docs/commands-reference/commit.md index 5980335899..4d8d35d42b 100644 --- a/static/docs/commands-reference/commit.md +++ b/static/docs/commands-reference/commit.md @@ -1,8 +1,8 @@ # commit Record changes to the repository by updating -[DVC-files](/doc/user-guide/dvc-file-format) and saving outputs to cache -directory. +[DVC-files](/doc/user-guide/dvc-file-format) and saving outputs to the +cache. ## Synopsis @@ -20,9 +20,8 @@ positional arguments: The `dvc commit` command is useful for several scenarios where a dataset is being changed: when a [stage](/doc/commands-reference/run) or [pipeline](/doc/commands-reference/pipeline) is in development, when one wishes -to run commands outside the control of DVC, or to force -[DVC-file](/doc/user-guide/dvc-file-format) updates to save time tying stages or -a pipeline. +to run commands outside the control of DVC, or to force DVC-file updates to save +time tying stages or a pipeline. - Code or data for a stage is under active development, with rapid iteration of code, configuration, or data. Run DVC commands (`dvc run`, `dvc repro`, and @@ -43,29 +42,29 @@ a pipeline. stages. `dvc commit` can help avoid having to reproduce a pipeline in these cases by forcing the update of the DVC-files. -The last two use cases are **not recommended**, and essentially force update the -DVC-files and save data to cache. They are still useful, but keep in mind that -DVC can't guarantee reproducibility in those cases – You commit any data you -want. Let's take a look at what is happening in the fist scenario closely: +Let's take a look at what is happening in the fist scenario closely. Normally +DVC commands like `dvc add`, `dvc repro` or `dvc run` commit the data to the +cache after creating a DVC-file. What _commit_ means is that DVC: -Normally DVC commands like `dvc add`, `dvc repro` or `dvc run` commit the data -to the cache after creating a DVC-file. What _commit_ means is that -DVC: - -- Computes a checksum for the file/directory -- Enters the checksum and file name into the DVC-file -- Tells the SCM to ignore the file/directory (e.g. add entry to `.gitignore`) - (Note that if the workspace was initialized with no SCM support +- Computes a checksum for the file/directory. +- Enters the checksum and file name into the DVC-file. +- Tells Git to ignore the file/directory (adding an entry to `.gitignore`). + (Note that if the project was initialized with no SCM support (`dvc init --no-scm`), this does not happen.) -- Adds the file/directory or to the cache directory +- Adds the file/directory or to the cache. There are many cases where the last step is not desirable (for example rapid iterations on an experiment). The `--no-commit` option prevents the last step from occurring (on the commands where it's available), saving time and space by not storing unwanted data artifacts. Checksums is still computed -and added to the DVC-file, but the actual data file is not saved in the DVC -cache. This is where the `dvc commit` command comes into play. It performs that -last step: storing the file in the cache directory. +and added to the DVC-file, but the actual data file is not saved in the cache. +This is where the `dvc commit` command comes into play. It performs that last +step (saving the data in cache). + +The last two scenarios are **not recommended**. They essentially force-update +the [DVC-files](/doc/user-guide/dvc-file-format) and save data to cache. They +are still useful, but keep in mind that DVC can't guarantee reproducibility in +those cases – where you commit any data you want. ## Options @@ -135,7 +134,7 @@ the cache with undesired intermediate results, we can run a single stage with `dvc run --no-commit`, or reproduce an entire pipeline using `dvc repro --no-commit`. This prevents data from being pushed to cache. When development of the stage is finished, `dvc commit` can be used to store data -files in the cache directory. +files in the cache. In the `featurize.dvc` stage, `src/featurize.py` is executed. A useful change to make is adjusting a parameter to `CountVectorizer` in that script. Namely, @@ -157,7 +156,7 @@ $ dvc repro --no-commit evaluate.dvc We can run this command as many times as we like, editing `featurize.py` any way we like, and so long as we use `--no-commit`, the data does not get saved to the -cache directory. But it is instructive to verify that's the case: +cache. Let's verify that's the case: First verification: @@ -196,8 +195,8 @@ wdir: . To verify this instance of `model.pkl` is not in the cache, we must know the path to the cached file. In the cache directory, the first two characters of the checksum are used as a subdirectory name, and the remaining characters are the -file name. Therefore, if the file had been committed to the cache it would -appear in the directory `.dvc/cache/70`. Let's check: +file name. Therefore, had the file been committed to the cache, it would appear +in the directory `.dvc/cache/70`. Let's check: ```dvc $ ls .dvc/cache/70 diff --git a/static/docs/commands-reference/fetch.md b/static/docs/commands-reference/fetch.md index e0cab2f6a3..3e17ae0f3a 100644 --- a/static/docs/commands-reference/fetch.md +++ b/static/docs/commands-reference/fetch.md @@ -2,7 +2,7 @@ Get files that are under DVC control from [remote](/doc/commands-reference/remote#description) storage into the -cache directory. +cache. ## Synopsis @@ -19,10 +19,11 @@ positional arguments: ## Description The `dvc fetch` command is a means to download files from remote storage into -the cache directory, but without placing them in the workspace. -This makes the data files available for linking (or copying) into the workspace. -(Refer to [dvc config cache.type](/doc/commands-reference/config#cache).) Along -with `dvc checkout`, it's performed automatically by `dvc pull` when the target +the cache of the project, but without placing them in the +workspace. This makes the data files available for linking (or +copying) into the workspace. (Refer to +[dvc config cache.type](/doc/commands-reference/config#cache).) Along with +`dvc checkout`, it's performed automatically by `dvc pull` when the target [DVC-files](/doc/user-guide/dvc-file-format) are not already in the cache: ``` @@ -34,7 +35,7 @@ remote storage | +------------+ | - - - - | dvc fetch | ++ v +------------+ + +----------+ -cache directory ++ | dvc pull | +project's cache ++ | dvc pull | + +------------+ + +----------+ | - - - - |dvc checkout| ++ | +------------+ @@ -42,22 +43,21 @@ cache directory ++ | dvc pull | workspace ``` -Fetching could be useful when first checking out an existing DVC -project, since files under DVC control could already exist in remote -storage, but won't be in the project's cache. (Refer to `dvc remote` for more -information on DVC remotes.) These necessary data or model files are listed as -dependencies or outputs in a DVC-file (target -[stage](/doc/commands-reference/run)) so they are required to -[reproduce](/doc/get-started/reproduce) the corresponding +Fetching could be useful when first checking out a DVC project, +since files under DVC control should already exist in remote storage, but won't +be in the project's cache. (Refer to `dvc remote` for more information on DVC +remotes.) These necessary data or model files are listed as dependencies or +outputs in a DVC-file (target [stage](/doc/commands-reference/run)) so they are +required to [reproduce](/doc/get-started/reproduce) the corresponding [pipeline](/doc/commands-reference/pipeline). (See [DVC-File Format](/doc/user-guide/dvc-file-format) for more information on dependencies and outputs.) `dvc fetch` ensures that the files needed for a DVC-file to be -[reproduced](/doc/get-started/reproduce) exist in the cache directory. If no -`targets` are specified, the set of data files to fetch is determined by -analyzing all DVC-files in the current branch, unless `--all-branches` or -`--all-tags` is specified. +[reproduced](/doc/get-started/reproduce) exist in cache. If no `targets` are +specified, the set of data files to fetch is determined by analyzing all +DVC-files in the current branch, unless `--all-branches` or `--all-tags` is +specified. The default remote is used unless `--remote` is specified. See `dvc remote add` for more information on how to configure different remote storage providers. @@ -191,7 +191,7 @@ $ tree .dvc β”œβ”€β”€ ... ``` -> `dvc status --cloud` (or `-c`) compares the cache directory vs. the default +> `dvc status --cloud` (or `-c`) compares the cache contents vs. the default > remote. As seen above, used without arguments, `dvc fetch` downloads all assets needed @@ -287,14 +287,15 @@ $ tree .dvc/cache └── a9c512fda11293cfee7617b66648dc ``` -Fetching using `--with-deps` starts with the target DVC-file (stage) and -searches backwards through its pipeline for data files to download into the -cache directory. All the data for the second and third stages ("featurize" and -"train") has now been downloaded to cache. We could now use `dvc checkout` to -get the data files needed to reproduce this pipeline up to the third stage into -the workspace (with `dvc repro train.dvc`). +Fetching using `--with-deps` starts with the target +[DVC-file](/doc/user-guide/dvc-file-format) (`train.dvc` stage) and searches +backwards through its pipeline for data to download into the project's cache. +All the data for the second and third stages ("featurize" and "train") has now +been downloaded to the cache. We could now use `dvc checkout` to get the data +files needed to reproduce this pipeline up to the third stage into the workspace +(with `dvc repro train.dvc`). > Note that in this sample project, the last stage file `evaluate.dvc` doesn't -> add any more data files than those form previous stages so at this point all -> of the files for this pipeline are in the project's cache and `dvc status -c` -> would output `Data and pipelines are up to date.` +> add any more data files than those from previous stages. So at this point +> (after reproducing `train.dvc`) all of the data for this pipeline is cached, +> and `dvc status -c` would output `Data and pipelines are up to date.` diff --git a/static/docs/commands-reference/init.md b/static/docs/commands-reference/init.md index 5fea5a5986..471f5bf2e2 100644 --- a/static/docs/commands-reference/init.md +++ b/static/docs/commands-reference/init.md @@ -14,15 +14,16 @@ usage: dvc init [-h] [-q | -v] [--no-scm] [-f] ## Description After DVC initialization, a new directory `.dvc/` will be created with `config` -and `.gitignore` files, and cache directory. These files and +and `.gitignore` files, and cache directory. These files and directories are hidden from the user generally and are not meant to be manipulated directly. `.dvc/cache` is one of the most important [DVC directories](/doc/user-guide/dvc-files-and-directories). It will hold all the contents of tracked data files. Note that `.dvc/.gitignore` lists this -directory, which means that the cache directory is not under Git control. This -is a local cache and you cannot `git push` it. +directory, which means that the +[cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory) +is not under Git control. This is a local cache and you cannot `git push` it. ## Options diff --git a/static/docs/commands-reference/metrics/index.md b/static/docs/commands-reference/metrics/index.md index 6089498212..cbde21151d 100644 --- a/static/docs/commands-reference/metrics/index.md +++ b/static/docs/commands-reference/metrics/index.md @@ -78,7 +78,7 @@ $ dvc metrics show data/eval.json: 0.624652 ``` -And finally let's remove `data/eval.json` from project's metrics: +And finally let's remove `data/eval.json` from the project's metrics: ```dvc $ dvc metrics remove data/eval.json diff --git a/static/docs/commands-reference/pull.md b/static/docs/commands-reference/pull.md index 16e3017927..3ea32a1a16 100644 --- a/static/docs/commands-reference/pull.md +++ b/static/docs/commands-reference/pull.md @@ -1,9 +1,9 @@ # pull Downloads missing files and directories from -[remote storage](/doc/commands-reference/remote) to the cache -directory based on [DVC-files](/doc/user-guide/dvc-file-format) in the -workspace, then links the downloaded files into the workspace. +[remote storage](/doc/commands-reference/remote) to the cache based +on [DVC-files](/doc/user-guide/dvc-file-format) in the workspace, +then links the downloaded files into the workspace. ## Synopsis @@ -43,9 +43,9 @@ only the files (or directories) missing from the workspace by searching all versions or branches of the repository if using Git, nor will it download files which have not changed. -The command `dvc status -c` can list files that are missing in the project's -cache, but referenced in its current DVC-files. It can be used to see what files -`dvc pull` would download. +The command `dvc status -c` can list files referenced in current DVC-files, but +missing in the cache. It can be used to see what files `dvc pull` +would download. If one or more `targets` are specified, DVC only considers the files associated with those DVC-files. Using the `--with-deps` option, DVC tracks dependencies @@ -108,9 +108,9 @@ done and set a context for the example, let's define an SSH remote with the `dvc remote add` command: ```dvc -$ dvc remote add r1 ssh://_username_@_host_/path/to/dvc/cache/directory +$ dvc remote add r1 ssh://_username_@_host_/path/to/dvc/remote/storage $ dvc remote list -r1 ssh://_username_@_host_/path/to/dvc/cache/directory +r1 ssh://_username_@_host_/path/to/dvc/remote/storage ``` > DVC supports several remote types. For details, see the @@ -158,8 +158,8 @@ model.p.dvc Dvcfile ``` -Imagine the remote storage has been modified such that the data files in some of -these stages should be updated into the cache directory. +Imagine the remote storage has been modified such that the data in some of these +stages should be updated in the workspace. ```dvc $ dvc status --cloud diff --git a/static/docs/commands-reference/push.md b/static/docs/commands-reference/push.md index a2266b6048..32d08a78f1 100644 --- a/static/docs/commands-reference/push.md +++ b/static/docs/commands-reference/push.md @@ -31,15 +31,14 @@ save any changes in the code or DVC-files. Those should be saved by using Under the hood a few actions are taken: - The push command by default uses all - [DVC-files](/doc/user-guide/dvc-file-format in the current version. The + [DVC-files](/doc/user-guide/dvc-file-format in the workspace. The command-line options listed below will either limit or expand the set of DVC-files to consult. -- For each output referenced from each selected DVC-files, it finds a - corresponding entry in the cache directory. DVC checks if the - entry exists, or not, in the remote simply by looking for it using the - checksum. From this DVC gathers a list of files missing from the remote - storage. +- For each output referenced from each selected DVC-file, DVC finds a + corresponding entry in the cache directory. DVC checks whether + the entry exists in the remote. From this DVC gathers a list of files missing + from the remote storage. - Upload the cache files missing from remote storage, if any, to the remote. @@ -57,8 +56,8 @@ of the project directory, nor will it upload files which have not changed. The `dvc status -c` command can list files tracked by DVC that are new in the -cache directory (compared to the default remote.) It can be used to see what -files `dvc push` would upload. +cache (compared to the default remote.) It can be used to see what files +`dvc push` would upload. If one or more `targets` are specified, DVC only considers the files associated with those DVC-files. Using the `--with-deps` option, DVC tracks dependencies @@ -158,8 +157,8 @@ model.p.dvc Dvcfile ``` -Imagine the project's cache has been modified such that the data files in some -of these stages should be uploaded to remote storage. +Imagine the project has been modified such that the output of some of these +stages should be uploaded to remote storage. ```dvc $ dvc status --cloud @@ -206,12 +205,12 @@ double check that all data had been uploaded. ## Example: What happens in the cache -Let's take a detailed look at what happens to the cache directory +Let's take a detailed look at what happens to the cache directory as you run an experiment locally and push data to remote storage. To set the example consider having created a workspace that contains some code and data, and having set up a remote. -Some work has been performed in the local workspace, and it contains new data to +Some work has been performed in the workspace, and it contains new data to upload to the shared remote. When running `dvc status --cloud` the report will list several files in `new` state. By looking in the cached directories we can see exactly what that means. @@ -260,10 +259,10 @@ The directory `.dvc/cache` is the local cache, while `../vault/recursive` is the remote storage – a "local remote" in this case. This listing shows the cache having more files in it than the remote does (which is what `new` means). -Next we can upload part of the data from the cache directory to a remote using -the command `dvc push --with-deps STAGE.dvc`. Remember that `--with-deps` -searches backwards from the DVC-file `targets` to locate files to upload, and -does not upload files in subsequent stages. +Next we can upload part of the data from the cache to the remote using the +command `dvc push --with-deps .dvc`. Remember that `--with-deps` searches +backwards from the DVC-file `targets` to locate files to upload, and does not +upload files in subsequent stages. After doing that we can inspect the remote storage again: @@ -292,8 +291,8 @@ $ tree ../vault/recursive The remote storage now has some of the files which had been missing, but not all of them. Indeed `dvc status --cloud` still lists a couple files as `new`. We can -clearly see this in that a couple files are in the cache directory and not in -the remote. +clearly see this above, since a couple files are in the cache, but not in the +remote. After running `dvc push` to cause all files to be uploaded, the remote storage now contains all of them: diff --git a/static/docs/commands-reference/remote/modify.md b/static/docs/commands-reference/remote/modify.md index 2df6ea35b9..eaeaf67b3a 100644 --- a/static/docs/commands-reference/remote/modify.md +++ b/static/docs/commands-reference/remote/modify.md @@ -30,7 +30,7 @@ Remote `name` and `option` name are required. Option names are remote type specific. See below examples and a list of per remote type: AWS S3, Google Cloud, Azure, SSH, ALiyun OSS, and others. -This command modifies a `remote` section in the DVC project's +This command modifies a `remote` section in the project's [config file](/doc/commands-reference/config). Alternatively, `dvc config` or manual editing could be used to change the configuration. @@ -122,7 +122,7 @@ these settings, you could use the following options: ``` - `acl` - set object level access control list (ACL) such as `private`, -`public-read`, etc. By default, no ACL is specified. + `public-read`, etc. By default, no ACL is specified. ```dvc $ dvc remote modify myremote acl bucket-owner-full-control @@ -263,13 +263,14 @@ For more information on configuring Azure Storage connection strings, visit ```dvc $ dvc remote modify myremote ask_password true ``` - -- `gss_auth` - use Generic Security Services authentication if available on - host (for example, [with kerberos](https://en.wikipedia.org/wiki/Generic_Security_Services_Application_Program_Interface#Relationship_to_Kerberos)). + +- `gss_auth` - use Generic Security Services authentication if available on host + (for example, + [with kerberos](https://en.wikipedia.org/wiki/Generic_Security_Services_Application_Program_Interface#Relationship_to_Kerberos)). Using this option requires `paramiko[gssapi]` which is currently only supported by our pip package and could be installed with - `pip install 'dvc[ssh_gssapi]'`. Other packages (Conda, Windows, Homebrew - cask and Mac pkg) do not support it. + `pip install 'dvc[ssh_gssapi]'`. Other packages (Conda, Windows, Homebrew cask + and Mac pkg) do not support it. ```dvc $ dvc remote modify myremote gss_auth true diff --git a/static/docs/commands-reference/status.md b/static/docs/commands-reference/status.md index 984cc50819..c715cced35 100644 --- a/static/docs/commands-reference/status.md +++ b/static/docs/commands-reference/status.md @@ -2,8 +2,8 @@ Show changes in the project [pipelines](/doc/commands-reference/pipeline), as well as mismatches either -between the cache directory and workspace files, or -between the cache and remote storage. +between the cache and workspace files, or between the +cache and remote storage. ## Synopsis @@ -85,8 +85,8 @@ outputs described in it. **For comparison against remote storage:** -- _new_ means that the file/directory exists in the cache directory but not in - remote storage. +- _new_ means that the file/directory exists in the cache but not in remote + storage. - _deleted_ means that the file/directory doesn't exist in the cache, but exists in remote storage. @@ -192,7 +192,6 @@ remote yet: ```dvc $ dvc status --remote storage - Preparing to collect status from s3://dvc-remote [##############################] 100% Collecting information new: data/model.p @@ -202,4 +201,4 @@ Preparing to collect status from s3://dvc-remote ``` The output shows where the location of the remote storage is, as well as any -differences between the cache directory and remote. +differences between the cache and `storage` remote. diff --git a/static/docs/get-started/initialize.md b/static/docs/get-started/initialize.md index ab36dfee7f..55bd42c97a 100644 --- a/static/docs/get-started/initialize.md +++ b/static/docs/get-started/initialize.md @@ -23,7 +23,7 @@ $ git commit -m "Initialize DVC project" ``` After DVC initialization, a new directory `.dvc/` will be created with `config` -and `.gitignore` files, and cache directory. These files and +and `.gitignore` files, and cache directory. These files and directories are hidden from the user generally and are not meant to be manipulated directly. diff --git a/static/docs/tutorial/define-ml-pipeline.md b/static/docs/tutorial/define-ml-pipeline.md index 0a47e8022c..f86d057ff5 100644 --- a/static/docs/tutorial/define-ml-pipeline.md +++ b/static/docs/tutorial/define-ml-pipeline.md @@ -66,7 +66,7 @@ If you take a look at the [DVC-file](/doc/user-guide/dvc-file-format) created by `dvc add`, you will see that only outputs are defined in `outs`. In this file, only one output is defined. The output contains the data file path in the repository and md5 checksum. This checksum determines a location of the actual -content file in the cache directory, `.dvc/cache`. +content file in the cache directory, `.dvc/cache`. ```dvc $ cat data/Posts.xml.zip.dvc @@ -81,10 +81,10 @@ $ du -sh .dvc/cache/ec/* ``` > Outputs from DVC-files define the relationship between the data file path in a -> repository and the path in a cache directory. +> repository and the path in the cache directory. -Keeping actual file content in a cache directory and a copy of the caches in the -user workspace during `$ git checkout` is a regular trick that +Keeping actual file contents in the cache, and a copy of the cached file in the +workspace during `$ git checkout` is a regular trick that [Git-LFS](https://git-lfs.github.com/) (Git for Large File Storage) uses. This trick works fine for tracking small files with source code. For large data files, this might not be the best approach, because of _checkout_ operation for @@ -191,7 +191,7 @@ and does some additional work if the command was successful: 1. DVC transforms all the outputs `-o` files into data files. It is like applying `dvc add` for each of the outputs. As a result, all the actual data - files content goes to the cache directory `.dvc/cache` and each + files content goes to the cache directory `.dvc/cache` and each of the file names will be added to `.gitignore`. 2. For reproducibility purposes, `dvc run` creates the `Posts.xml.dvc` stage diff --git a/static/docs/tutorial/preparation.md b/static/docs/tutorial/preparation.md index 8f0c9f5711..394c763448 100644 --- a/static/docs/tutorial/preparation.md +++ b/static/docs/tutorial/preparation.md @@ -68,7 +68,7 @@ DVC works on top of Git repositories. You run DVC initialization in a repository directory to create DVC meta files and directories. After DVC initialization, a new directory `.dvc/` will be created with `config` -and `.gitignore` files, and cache directory. These files and +and `.gitignore` files, and cache directory. These files and directories are hidden from the user generally and are not meant to be manipulated directly. However, we describe some DVC internals below for a better understanding of how it works. diff --git a/static/docs/tutorial/sharing-data.md b/static/docs/tutorial/sharing-data.md index 237321f2e7..1f8f72670f 100644 --- a/static/docs/tutorial/sharing-data.md +++ b/static/docs/tutorial/sharing-data.md @@ -28,7 +28,7 @@ $ git status -s M .dvc/config ``` -Then, a simple command pushes files from your cache directory to the cloud: +Then, a simple command pushes files from your cache to the cloud: ```dvc $ dvc push diff --git a/static/docs/understanding-dvc/core-features.md b/static/docs/understanding-dvc/core-features.md index 9a730c1ae1..79dbeb1e54 100644 --- a/static/docs/understanding-dvc/core-features.md +++ b/static/docs/understanding-dvc/core-features.md @@ -8,7 +8,7 @@ graphs. 3. **Large data file versioning** works by creating pointers in your Git - repository to the cache directory on a local hard drive. + repository to the cache, typically stored on a local hard drive. 4. **Programming language agnostic**: Python, R, Julia, shell scripts, etc. ML library agnostic: Keras, Tensorflow, PyTorch, scipy, etc. diff --git a/static/docs/understanding-dvc/how-it-works.md b/static/docs/understanding-dvc/how-it-works.md index 487afa48c8..039dee4fca 100644 --- a/static/docs/understanding-dvc/how-it-works.md +++ b/static/docs/understanding-dvc/how-it-works.md @@ -48,7 +48,7 @@ ```dvc $ git checkout a03_normbatch_vgg16 # checkout code and DVC-files - $ dvc checkout # checkout data files from the cache directory + $ dvc checkout # checkout data files from the cache $ ls -l data/ # These LARGE files came from the cache, not from Git total 1017488 @@ -72,12 +72,12 @@ Rscript plot.R result.csv plots.jpg ``` -7. A DVC project's cache can be shared with your colleagues and partners through - AWS S3, Azure Blob Storage or GCP Storage: +7. The cache of a DVC project can be shared with your colleagues and partners + through AWS S3, Azure Blob Storage GCP Storage, among others: ```dvc $ git push - $ dvc push # push from the cache directory to remote storage + $ dvc push # push from the cache to remote storage # On a colleague machine: $ git clone https://github.com/dataversioncontrol/myrepo.git diff --git a/static/docs/understanding-dvc/related-technologies.md b/static/docs/understanding-dvc/related-technologies.md index fa3241db3e..75d268796b 100644 --- a/static/docs/understanding-dvc/related-technologies.md +++ b/static/docs/understanding-dvc/related-technologies.md @@ -38,7 +38,7 @@ process. - DVC has transparent design. Its [internal files and directories](/doc/user-guide/dvc-files-and-directories) - (including the cache directory) have a human-readable format and + (including the cache directory) have a human-readable format and can be easily reused by external tools. 4. **Git workflows** and Git usage methodologies such as Gitflow. The diff --git a/static/docs/understanding-dvc/what-is-dvc.md b/static/docs/understanding-dvc/what-is-dvc.md index 9c08dae3ac..2cb6607401 100644 --- a/static/docs/understanding-dvc/what-is-dvc.md +++ b/static/docs/understanding-dvc/what-is-dvc.md @@ -22,8 +22,8 @@ DVC uses a few core concepts: features, change model hyperparameters, data cleaning, add a new data source) should be performed in a separate branch and then merged into the master branch only if the experiment is successful. DVC allows experiments to be - integrated into a project's history and NEVER needs to recompute the results - after a successful merge. + integrated into a Git repository history and NEVER needs to recompute the + results after a successful merge. - **Experiment state** or state: Equivalent to a Git snapshot (all committed files). Git checksum, branch name, or tag can be used as a reference to a @@ -48,7 +48,7 @@ DVC uses a few core concepts: in Git for DVC needs (to maintain pipelines and reproducibility). - **Cache directory**: Directory with all data files on a local hard drive or in - cloud storage, but not in the Git repository. + cloud storage, but not in the Git repository. See `dvc cache dir`. - **Cloud storage** support: available complement to the core DVC features. This is how a data scientist transfers large data files or shares a GPU-trained diff --git a/static/docs/user-guide/dvc-files-and-directories.md b/static/docs/user-guide/dvc-files-and-directories.md index 41417fc02e..10944dc336 100644 --- a/static/docs/user-guide/dvc-files-and-directories.md +++ b/static/docs/user-guide/dvc-files-and-directories.md @@ -45,8 +45,8 @@ directory (`.dvc/`) with special internal files and directories: ## Structure of cache directory There are two ways in which the data is stored in cache. It depends -on if the actual data is stored in a file (eg. `data.csv`) or it is a directory -of files. +on whether the actual data is stored in a single file (eg. `data.csv`) or in a +directory of files. We evaluate a checksum, usually MD5, for the data file which is a 32 characters long string. The first two characters are assigned to name the directory inside @@ -108,3 +108,5 @@ $ cat .dvc/cache/19/6a322c107c2572335158503c64bfba.dir {"md5": "29a6c8271c0c8fbf75d3b97aecee589f", "relpath": "index.jpeg"} ] ``` + +See also `dvc cache dir` to set the location of the cache directory. diff --git a/static/docs/user-guide/large-dataset-optimization.md b/static/docs/user-guide/large-dataset-optimization.md index db0f278d8c..97aa6f5bbc 100644 --- a/static/docs/user-guide/large-dataset-optimization.md +++ b/static/docs/user-guide/large-dataset-optimization.md @@ -1,8 +1,8 @@ # Large Dataset Optimization In order to track the data files and directories added with `dvc add` or -`dvc run`, DVC moves all these files to a special cache directory. -A DVC project's cache is the hidden storage (by default located in +`dvc run`, DVC moves all these files to a special cache. A +DVC project's cache is the hidden storage (by default located in `.dvc/cache`) for files that are under DVC control, and their different versions. (See `dvc cache` and [DVC Files and Directories](/doc/user-guide/dvc-files-and-directories) for more @@ -38,11 +38,11 @@ Symbolic links, and Reflinks in more recent systems. While reflinks bring all the benefits and none of the worries, they're not commonly supported in most platforms yet. Hard/soft links optimize **speed** and **space** in the file system, but may break your workflow since updating hard/sym-linked files tracked -by DVC in the workspace causes cache corruption. These 2 link types thus require -using cache **protected mode** (see the `cache.protected` config option in -`dvc config cache`). Finally, a 4th "linking" option is to actually copy files -from/to the cache, which is safe but inefficient, especially for large files -(several GBs or more data). +by DVC in the workspace causes cache corruption. These 2 link types +thus require using cache **protected mode** (see the `cache.protected` config +option in `dvc config cache`). Finally, a 4th "linking" option is to actually +copy files from/to the cache, which is safe but inefficient, especially for +large files (several GBs or more data). > Some versions of Windows (e.g. Windows Server 2012+ and Windows 10 Enterprise) > support hard or soft links on the From 3c0db9f20f23766afbd3570a337165362dc95451 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 5 Sep 2019 01:11:08 -0500 Subject: [PATCH 18/26] user-guide: link "cache directory" term where appropriate --- static/docs/commands-reference/cache/dir.md | 10 ++++++---- static/docs/commands-reference/init.md | 5 ++--- static/docs/commands-reference/push.md | 13 +++++++------ static/docs/tutorial/define-ml-pipeline.md | 12 +++++++----- 4 files changed, 22 insertions(+), 18 deletions(-) diff --git a/static/docs/commands-reference/cache/dir.md b/static/docs/commands-reference/cache/dir.md index b06cc577d6..bdb5ecc265 100644 --- a/static/docs/commands-reference/cache/dir.md +++ b/static/docs/commands-reference/cache/dir.md @@ -16,10 +16,12 @@ positional arguments: ## Description -Helper to set the `cache.dir` configuration option. Unlike doing so with -`dvc config cache`, this command transform paths (`value`) that are provided -relative to the current working directory into paths **relative to the config -file location**. They are required in the latter form for the config file. +Helper to set the `cache.dir` configuration option. (See +[cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory).) +Unlike doing so with `dvc config cache`, this command transform paths (`value`) +that are provided relative to the current working directory into paths +**relative to the config file location**. They are required in the latter form +for the config file. ## Options diff --git a/static/docs/commands-reference/init.md b/static/docs/commands-reference/init.md index 471f5bf2e2..147ba53cac 100644 --- a/static/docs/commands-reference/init.md +++ b/static/docs/commands-reference/init.md @@ -21,9 +21,8 @@ manipulated directly. `.dvc/cache` is one of the most important [DVC directories](/doc/user-guide/dvc-files-and-directories). It will hold all the contents of tracked data files. Note that `.dvc/.gitignore` lists this -directory, which means that the -[cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory) -is not under Git control. This is a local cache and you cannot `git push` it. +directory, which means that the cache directory is not under Git control. This +is a local cache and you cannot `git push` it. ## Options diff --git a/static/docs/commands-reference/push.md b/static/docs/commands-reference/push.md index 32d08a78f1..97668e0981 100644 --- a/static/docs/commands-reference/push.md +++ b/static/docs/commands-reference/push.md @@ -36,9 +36,9 @@ Under the hood a few actions are taken: DVC-files to consult. - For each output referenced from each selected DVC-file, DVC finds a - corresponding entry in the cache directory. DVC checks whether - the entry exists in the remote. From this DVC gathers a list of files missing - from the remote storage. + corresponding entry in thecache. DVC checks whether the entry + exists in the remote. From this DVC gathers a list of files missing from the + remote storage. - Upload the cache files missing from remote storage, if any, to the remote. @@ -205,15 +205,16 @@ double check that all data had been uploaded. ## Example: What happens in the cache -Let's take a detailed look at what happens to the cache directory +Let's take a detailed look at what happens to the +[cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory) as you run an experiment locally and push data to remote storage. To set the example consider having created a workspace that contains some code and data, and having set up a remote. Some work has been performed in the workspace, and it contains new data to upload to the shared remote. When running `dvc status --cloud` the report will -list several files in `new` state. By looking in the cached directories we can -see exactly what that means. +list several files in `new` state. We can see exactly what that means by looking +in the project's cache: ```dvc $ tree .dvc/cache diff --git a/static/docs/tutorial/define-ml-pipeline.md b/static/docs/tutorial/define-ml-pipeline.md index f86d057ff5..de7ba2fcaf 100644 --- a/static/docs/tutorial/define-ml-pipeline.md +++ b/static/docs/tutorial/define-ml-pipeline.md @@ -66,7 +66,9 @@ If you take a look at the [DVC-file](/doc/user-guide/dvc-file-format) created by `dvc add`, you will see that only outputs are defined in `outs`. In this file, only one output is defined. The output contains the data file path in the repository and md5 checksum. This checksum determines a location of the actual -content file in the cache directory, `.dvc/cache`. +content file in the +[cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory), +`.dvc/cache`. ```dvc $ cat data/Posts.xml.zip.dvc @@ -83,10 +85,10 @@ $ du -sh .dvc/cache/ec/* > Outputs from DVC-files define the relationship between the data file path in a > repository and the path in the cache directory. -Keeping actual file contents in the cache, and a copy of the cached file in the -workspace during `$ git checkout` is a regular trick that -[Git-LFS](https://git-lfs.github.com/) (Git for Large File Storage) uses. This -trick works fine for tracking small files with source code. For large data +Keeping actual file contents in the cache, and a copy of the cached +file in the workspace during `$ git checkout` is a regular trick +that [Git-LFS](https://git-lfs.github.com/) (Git for Large File Storage) uses. +This trick works fine for tracking small files with source code. For large data files, this might not be the best approach, because of _checkout_ operation for a 10Gb data file might take several seconds and a 50GB file checkout (think copy) might take a few minutes. From 5fe82e7e7d8f228101abb29d1120af3ff7def42d Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 5 Sep 2019 01:14:55 -0500 Subject: [PATCH 19/26] cmd ref: change from HEAD to "tip of default branch" in --rev option desc. per https://github.com/iterative/dvc.org/pull/601#discussion_r319747025 --- static/docs/commands-reference/get.md | 4 ++-- static/docs/commands-reference/import.md | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/static/docs/commands-reference/get.md b/static/docs/commands-reference/get.md index c85ef7d7d8..89562c246c 100644 --- a/static/docs/commands-reference/get.md +++ b/static/docs/commands-reference/get.md @@ -42,8 +42,8 @@ created in the current working directory, with its original file name. isn't used) is the current working directory (`.`) and original file name. - `--rev` - specific Git revision of the DVC repository to import the data from. - [`HEAD`](https://git-scm.com/book/en/v2/Git-Internals-Git-References#ref_the_ref) - is used by default when this option is not specified. + The tip of the default branch is used by default when this option is not + specified. - `-h`, `--help` - prints the usage/help message, and exit. diff --git a/static/docs/commands-reference/import.md b/static/docs/commands-reference/import.md index 0c2ec76709..965c371b52 100644 --- a/static/docs/commands-reference/import.md +++ b/static/docs/commands-reference/import.md @@ -62,8 +62,8 @@ downloaded data artifact from the external DVC repo. isn't used) is the current working directory (`.`) and original file name. - `--rev` - specific Git revision of the DVC repository to import the data from. - [`HEAD`](https://git-scm.com/book/en/v2/Git-Internals-Git-References#ref_the_ref) - is used by default when this option is not specified. + The tip of the default branch is used by default when this option is not + specified. - `-h`, `--help` - prints the usage/help message, and exit. From a5019efec9ea799512e7cbaa94da52b3049cd8dd Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 5 Sep 2019 01:24:46 -0500 Subject: [PATCH 20/26] get-started: reword stage file commands explanation per https://github.com/iterative/dvc.org/pull/601#pullrequestreview-283328338 --- static/docs/get-started/reproduce.md | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/static/docs/get-started/reproduce.md b/static/docs/get-started/reproduce.md index 064a173f9d..4f18d78175 100644 --- a/static/docs/get-started/reproduce.md +++ b/static/docs/get-started/reproduce.md @@ -3,10 +3,9 @@ In the previous chapters, we described our first [pipeline]](/doc/commands-reference/pipeline). Basically, we generated a number of [stage files](/doc/commands-reference/run) -([DVC-files](/doc/user-guide/dvc-file-format)). Each of these stages define -single commands to execute towards a final result. Each depends on some data -(either raw data files or intermediate results from previous stages) and code -files. +([DVC-files](/doc/user-guide/dvc-file-format)). These stages define individual +commands to execute towards a final result. Each depends on some data (either +raw data files or intermediate results from previous stages) and code files. If you just cloned the [project](https://github.com/iterative/example-get-started), make sure you first From 46aa961e6b89144938a8d466fbf2808302366ddf Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 5 Sep 2019 01:29:38 -0500 Subject: [PATCH 21/26] cmd ref: fix closing `)` in run and hyphenate "non-deterministic" in repro per https://github.com/iterative/dvc.org/pull/601#pullrequestreview-283328978 --- static/docs/commands-reference/repro.md | 4 ++-- static/docs/commands-reference/run.md | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/static/docs/commands-reference/repro.md b/static/docs/commands-reference/repro.md index 0e99a9921a..554ded7502 100644 --- a/static/docs/commands-reference/repro.md +++ b/static/docs/commands-reference/repro.md @@ -98,8 +98,8 @@ specified), and updates stage files with the new checksum information. stages following the changed stage, even if their direct dependencies did not change. Like with the same option on `dvc run`, this is a way to force execute stages without changes. This can also be useful for pipelines containing - stages that produce nondeterministic (semi-random) outputs. For - nondeterministic stages the outputs can vary on each execution, meaning the + stages that produce non-deterministic (semi-random) outputs. For + non-deterministic stages the outputs can vary on each execution, meaning the cache cannot be trusted for such stages. - `--downstream` - only execute the stages after the given `targets` in their diff --git a/static/docs/commands-reference/run.md b/static/docs/commands-reference/run.md index a42a6f895d..4ef369323f 100644 --- a/static/docs/commands-reference/run.md +++ b/static/docs/commands-reference/run.md @@ -125,7 +125,7 @@ pipeline. for confirmation. - `--ignore-build-cache` - if an exactly equal DVC-file exists (same list of - outputs and inputs, the same command to run which has been already executed), + outputs and inputs, the same command to run) which has been already executed, and is up to date, `dvc run` won't normally execute the command again (thus "build cache"). This option gives a way to forcefully execute the command anyway. It's useful if the command is non-deterministic (meaning it produces From 8753d056c9c8ed2a482bf4f702d56b583f363e97 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 5 Sep 2019 02:06:14 -0500 Subject: [PATCH 22/26] cmd ref: explain outputs better in `add` per https://github.com/iterative/dvc.org/pull/601#pullrequestreview-283899801 --- static/docs/commands-reference/add.md | 44 +++++++++++++++------------ 1 file changed, 24 insertions(+), 20 deletions(-) diff --git a/static/docs/commands-reference/add.md b/static/docs/commands-reference/add.md index e8f5f582a3..bee7cdb2ba 100644 --- a/static/docs/commands-reference/add.md +++ b/static/docs/commands-reference/add.md @@ -15,10 +15,16 @@ positional arguments: ## Description -The `dvc add` command is analogous to the `git add` command. By default an added -file is committed to the cache. Using the `--no-commit` option, the -file will not be added to the cache and instead the `dvc commit` command is used -when (or if) the file is to be committed to the cache. +The `dvc add` command is analogous to the `git add` command. By default though, +an added file or directory is also committed to the cache. (Use the +`--no-commit` option to avoid this, and `dvc commit` to commit the data to cache +as a separate step.) + +This command's `targets` are files or directories to be places under DVC +control. These are turned into outputs (`outs` field) in a resulting +[DVC-file](/doc/user-guide/dvc-file-format). (See steps below for more details.) +Note that target data outside the current workspace is supported, +which turn into [external outputs](/doc/user-guide/external-outputs). Under the hood, a few actions are taken for each file (or directory) in `targets`: @@ -27,27 +33,25 @@ Under the hood, a few actions are taken for each file (or directory) in 2. Move the file contents to the cache directory (by default in `.dvc/cache`), using the checksum to form the cached file name. 3. Replace the file by a link to the file in cache (see details below). -4. Create a corresponding [DVC-file](/doc/user-guide/dvc-file-format) and store - the checksum to identify the cached file. Unless the `-f` options is used, - the DVC-file name generated is `.dvc` by default, where `` is - file name of the first output (from `targets`). -5. Add the `targets` in the workspace to `.gitignore` to prevent it - from being committed to the Git repository, unless `--no-scm` was used when - [initializing](/doc/commands-reference/init) this project. -6. Instructions are printed showing `git` commands for adding the files to a Git - repository, unless `--no-scm` was used. +4. Create a corresponding DVC-file and store the checksum to identify the cached + file. Unless the `-f` option is used, the DVC-file name generated by default + is `.dvc`, where `` is the file name of the first target. +5. Unless `dvc init --no-scm` was used when initializing the project, add the + `targets` to `.gitignore` in order to prevent them from being committed to + the Git repository. +6. Unless `dvc init --no-scm` was used when initializing the project, + instructions are printed showing `git` commands for adding the files to a Git + repository. The result is that the target data gets cached by DVC, and instead small DVC-files can be tracked with Git. The DVC-file lists the added file as an output (`outs` field), and references the cached file using the checksum. See [DVC-File Format](/doc/user-guide/dvc-file-format) for more details. -Note that `targets` outside the current workspace are supported, creating -[external outputs](/doc/user-guide/external-outputs). - -> Note that DVC-files created by this command are _orphans_: they have no -> dependencies. _Orphan_ "stage files" are always considered _changed_ by -> `dvc repro`, which always executes them. +> Note that DVC-files created by this command are considered _orphans_ because +> they have no dependencies, only outputs. These _orphan_ "stage files" are +> always treated as _changed_ by `dvc repro`, which always executes them. See +> `dvc run` to learn about regular stage files. By default DVC tries to use reflinks (see [File link types](/docs/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache) @@ -227,7 +231,7 @@ top-level DVC-file is generated. But this is less convenient. With the `dvc add pics` a single DVC-file is generated, `pics.dvc`, which lets us treat the entire directory structure in one unit. It lets you pass the whole -directory tree as a dependency to a `dvc run` stage like so: +directory tree as a dependency to a `dvc run` stage definition, like this: ```dvc $ dvc run -f train.dvc \ From e4ce0246042114ab17983a6796a715dcfc892ae1 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 5 Sep 2019 11:17:01 -0500 Subject: [PATCH 23/26] comlpemenet last commit --- static/docs/commands-reference/add.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/static/docs/commands-reference/add.md b/static/docs/commands-reference/add.md index bee7cdb2ba..cfb44d799a 100644 --- a/static/docs/commands-reference/add.md +++ b/static/docs/commands-reference/add.md @@ -17,14 +17,14 @@ positional arguments: The `dvc add` command is analogous to the `git add` command. By default though, an added file or directory is also committed to the cache. (Use the -`--no-commit` option to avoid this, and `dvc commit` to commit the data to cache -as a separate step.) +`--no-commit` option to avoid this, and `dvc commit` as a separate step when +ready.) -This command's `targets` are files or directories to be places under DVC -control. These are turned into outputs (`outs` field) in a resulting +The `targets` are files or directories to be places under DVC control. These are +turned into outputs (`outs` field) in a resulting [DVC-file](/doc/user-guide/dvc-file-format). (See steps below for more details.) Note that target data outside the current workspace is supported, -which turn into [external outputs](/doc/user-guide/external-outputs). +which becomes [external outputs](/doc/user-guide/external-outputs). Under the hood, a few actions are taken for each file (or directory) in `targets`: From 62742ed96648f46fc0d328e5389888b68b3074c2 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 5 Sep 2019 12:20:21 -0500 Subject: [PATCH 24/26] term: review DVC branding up to static/docs/commands-reference/metrics for #448 --- static/docs/commands-reference/get.md | 23 +++++++------- static/docs/commands-reference/import-url.md | 8 ++--- static/docs/commands-reference/import.md | 16 +++++----- static/docs/commands-reference/index.md | 26 ++++++++-------- static/docs/commands-reference/init.md | 4 +-- static/docs/commands-reference/install.md | 30 +++++++++---------- .../docs/commands-reference/metrics/index.md | 8 ++--- .../docs/commands-reference/metrics/show.md | 6 ++-- 8 files changed, 59 insertions(+), 62 deletions(-) diff --git a/static/docs/commands-reference/get.md b/static/docs/commands-reference/get.md index 89562c246c..8658bb98a5 100644 --- a/static/docs/commands-reference/get.md +++ b/static/docs/commands-reference/get.md @@ -1,7 +1,7 @@ # get -Download or copy file or directory from another DVC repository (on a git server -such as Github) into the local file system. +Download or copy file or directory from another DVC repository (on a Git server +e.g. Github) into the local file system. > Unlike `dvc import`, this command does not track the downloaded data files > (does not create a DVC-file). @@ -23,9 +23,10 @@ other files and directories tracked in another DVC repository into the current working directory, regardless of whether it's a DVC project. The `dvc get` command downloads such a data artifact. -The `url` argument specifies the external DVC project's Git repository URL (both -HTTP and SSH protocols supported, e.g. `[user@]server:project.git`), while -`path` is used to specify the path to the data to be downloaded within the repo. +The `url` argument specifies the address of the Git repository containing the +external DVC project (both HTTP and SSH protocols supported, e.g. +`[user@]server:project.git`). `path` is used to specify the path of the data to +be downloaded within the repo. Note that this command doesn't require an existing DVC project to run in. It's a single-purpose command that can be used out of the box after installing DVC. @@ -80,12 +81,12 @@ is found, which specifies `model.pkl` in its outputs (`outs`). DVC then its [config file](https://github.com/iterative/example-get-started/blob/master/.dvc/config)). -A common use for downloading binary files from DVC repos, as done in this -example, is to place a ML model inside a wrapper application that serves as an -[ETL](https://en.wikipedia.org/wiki/Extract,_transform,_load) pipeline or as an -HTTP/RESTful API (web service) that provides predictions upon request. This can -be automated leveraging DVC with [CI/CD](https://en.wikipedia.org/wiki/CI/CD) -tools. +A recommended use for downloading binary files from DVC repositories, as done in +this example, is to place a ML model inside a wrapper application that serves as +an [ETL](https://en.wikipedia.org/wiki/Extract,_transform,_load) pipeline or as +an HTTP/RESTful API (web service) that provides predictions upon request. This +can be automated leveraging DVC with +[CI/CD](https://en.wikipedia.org/wiki/CI/CD) tools. The same example applies to raw or intermediate data files as well, of course, for cases where we want to download those files and perform some analysis on diff --git a/static/docs/commands-reference/import-url.md b/static/docs/commands-reference/import-url.md index 77f5ec18ad..a1ecd651f3 100644 --- a/static/docs/commands-reference/import-url.md +++ b/static/docs/commands-reference/import-url.md @@ -41,8 +41,8 @@ DVC supports [DVC-files](/doc/user-guide/dvc-file-format) which refer to data in an external location, see [External Dependencies](/doc/user-guide/external-dependencies). In such a DVC-file, the `deps` section stores the remote URL, and the `outs` section -contains the corresponding local path in the workspace. It records enough data -from the external file or directory to enable DVC to efficiently check it to +contains the corresponding local path in the workspace. It records metadata from +the external file or directory, allowing DVC to efficiently check it later and determine whether the local copy is out of date. DVC supports several types of (local or) remote locations (protocols): @@ -184,8 +184,8 @@ outs: The `etag` field in the DVC-file contains the [ETag](https://en.wikipedia.org/wiki/HTTP_ETag) recorded from the HTTP request. -If the remote file changes, its ETag will be different, letting DVC know whether -its necessary to download it again. +If the remote file changes, its ETag will be different. This metadata allows DVC +to determine whether its necessary to download it again. > See [DVC-File Format](/doc/user-guide/dvc-file-format) for more details on the > text format above. diff --git a/static/docs/commands-reference/import.md b/static/docs/commands-reference/import.md index 965c371b52..2b0ad821e9 100644 --- a/static/docs/commands-reference/import.md +++ b/static/docs/commands-reference/import.md @@ -28,10 +28,10 @@ workspace. The `dvc import` command downloads such a data artifact in a way that it is tracked with DVC, so it can be updated when the external data source changes. -The `url` argument specifies the Git repository URL of the external DVC -project (both HTTP and SSH protocols are supported, e.g. -`[user@]server:project.git`), while `path` is used to specify the path to the -data to be downloaded within the repo. +The `url` argument specifies the address of the Git repository containing the +external DVC project (both HTTP and SSH protocols supported, e.g. +`[user@]server:project.git`). `path` is used to specify the path of the data to +be downloaded within the repo. > See `dvc import-url` to download and tack data from other supported URLs. @@ -53,7 +53,7 @@ To actually [track the data](https://dvc.org/doc/get-started/add-files), Note that import stages are considered always "locked", meaning that if you run `dvc repro`, they won't be updated. Use `dvc update` on them to update the -downloaded data artifact from the external DVC repo. +downloaded data artifact from the external DVC repository. ## Options @@ -74,8 +74,8 @@ downloaded data artifact from the external DVC repo. ## Examples -An obvious case for this command is to import a dataset from an external DVC -repo, such as our +A simple case for this command is to import a dataset from an external DVC repo, +such as our [get started example repo](https://github.com/iterative/example-get-started). ```dvc @@ -111,5 +111,3 @@ outs: Several of the values above are pulled from the original stage file `model.pkl.dvc` in the external DVC repo. `url` and `rev_lock` fields are used to specify the origin and version of the dependency. - - diff --git a/static/docs/commands-reference/index.md b/static/docs/commands-reference/index.md index 3ae493b0cb..2c3ea0afef 100644 --- a/static/docs/commands-reference/index.md +++ b/static/docs/commands-reference/index.md @@ -2,16 +2,16 @@ DVC is a command-line tool. The typical use case for DVC goes as follows: -- In an existing Git repository, initialize a DVC repository with `dvc init`. -- Copy source code files for modeling into the repository and convert the files - into DVC data files with `dvc add` command. -- Process raw data files through your data processing and modeling code using - the `dvc run` command. -- Use `--outs` option to specify `dvc run` command outputs which will be - converted to DVC data files after the code runs. -- Clone a git repo with the code of your ML application pipeline. However, this - will not copy your DVC cache. Use - [data remotes](/doc/commands-reference/remote) and `dvc push` to share the - cache (data). -- Use `dvc repro` to quickly reproduce your pipeline on a new iteration, after - your data item files or source code of your ML application are modified. +- In an existing Git repository, initialize a DVC project with + `dvc init`. +- Copy source code files for modeling into the repository and track the files + with DVC using the `dvc add` command. +- Process raw data with your own data processing and modeling code using the + `dvc run` command, using the `--outs` option to outputs which will also be + tracked by DVC after the code is executed. +- Sharing a Git repository with the source code of your ML + [pipeline](/doc/commands-reference/pipeline) will not include the project's + cache. Use [remote storage](/doc/commands-reference/remote) and + `dvc push` to share this cache (data tracked by DVC). +- Use `dvc repro` to automatically reproduce your full pipeline, iteratively as + input data or source code change. diff --git a/static/docs/commands-reference/init.md b/static/docs/commands-reference/init.md index 147ba53cac..bbbf5e6a17 100644 --- a/static/docs/commands-reference/init.md +++ b/static/docs/commands-reference/init.md @@ -1,6 +1,6 @@ # init -This command initializes a DVC project on a directory. +This command initializes a DVC project on a directory. Note that by default the current working directory is expected to contain a Git repository, unless the `--no-scm` option is used. @@ -42,7 +42,7 @@ is a local cache and you cannot `git push` it. ## Examples -Creating a new DVC repository (requires a Git repository). +Create a new DVC repository (requires Git): ```dvc $ mkdir example && cd example diff --git a/static/docs/commands-reference/install.md b/static/docs/commands-reference/install.md index dbc0521873..b64fa9e434 100644 --- a/static/docs/commands-reference/install.md +++ b/static/docs/commands-reference/install.md @@ -1,6 +1,6 @@ # install -Install DVC hooks into the Git repository to automate certain common actions. +Install Git hooks into the DVC repository to automate certain common actions. ## Synopsis @@ -17,30 +17,28 @@ automatically. Namely: -**Checkout**: For any given branch or tag, Git checks out the +**Checkout**: For any given branch or tag, `git checkout` retrieves the [DVC-files](/doc/user-guide/dvc-file-format) corresponding to that version. The -DVC-files in turn refer to data files in the DVC cache by checksum. -When switching from one SCM branch or tag to another, the SCM retrieves the -corresponding DVC-files. By default that leaves the project in a -state where the DVC-files refer to data files other than what is currently in -the workspace. The user at this point should run `dvc checkout` so -that the data files will match the current DVC-files. +project's DVC-files in turn refer to data stored in +cache, but not necessarily in the workspace. Normally, +it would be necessary to run `dvc checkout` to synchronize workspace and +DVC-files. The installed Git hook automates running `dvc checkout`. **Commit**: When committing a change to the Git repository, that change possibly requires reproducing the corresponding -[pipeline](/doc/commands-reference/pipeline) (with `dvc repro`) to regenerate -the project results. Or there might be files not yet in the cache, which is a -reminder to run `dvc commit`. +[pipeline](/doc/commands-reference/pipeline) (using `dvc repro`) to regenerate +the project results. Or there might be new data not yet in cache, which requires +running `dvc commit` to update. The installed Git hook automates reminding the user to run either `dvc repro` or -`dvc commit`. +`dvc commit`, as needed. **Push**: While publishing changes to the Git remote repository with `git push`, -it easy to forget that `dvc push` command usually needs to be run to save -corresponding changes in data files and directories that are under DVC control -to the DVC remote storage. +it easy to forget that the `dvc push` command is necessary to upload new or +updated data files and directories under DVC control to +[remote storage](/doc/commands-reference/remote). The installed Git hook automates executing `dvc push`. @@ -51,7 +49,7 @@ The installed Git hook automates executing `dvc push`. - A `post-checkout` hook executes `dvc checkout` after `git checkout` to automatically synchronize the data files with the new workspace state. - A `pre-push` hook executes `dvc push` before `git push` to upload files and - directories under DVC control to remote. + directories under DVC control to remote storage. For more information about git hooks, refer to the [git-scm documentation](https://git-scm.com/docs/githooks). diff --git a/static/docs/commands-reference/metrics/index.md b/static/docs/commands-reference/metrics/index.md index cbde21151d..a9327bdec1 100644 --- a/static/docs/commands-reference/metrics/index.md +++ b/static/docs/commands-reference/metrics/index.md @@ -31,7 +31,7 @@ way to compare and pick the best performing experiment variant. [show](/doc/commands-reference/metrics/show), [modify](/doc/commands-reference/metrics/modify), and [remove](/doc/commands-reference/metrics/remove) commands are available to set -up and manage DVC metrics. +up and manage DVC project metrics. ## Options @@ -56,7 +56,7 @@ $ dvc run -d code/evaluate.py -M data/eval.json \ > running `dvc metrics add data/eval.json` to explicitly mark `data/eval.json` > as a metric file. -Now let's print metric values that we are tracking in this DVC project: +Now let's print metric values that we are tracking in this project: ```dvc $ dvc metrics show -a @@ -65,8 +65,8 @@ $ dvc metrics show -a data/eval.json: {"AUC": "0.624652"} ``` -Then we can tell DVC an `xpath` for the metric file, so that it can output only -the value of AUC. In the case of JSON, it uses +We can also tell DVC an `xpath` for the metric file, so that it can output only +the value of AUC. In the case of JSON, use [JSONPath expressions](https://goessner.net/articles/JsonPath/index.html) to selectively extract data out of metric files: diff --git a/static/docs/commands-reference/metrics/show.md b/static/docs/commands-reference/metrics/show.md index 824b005e14..3f3ff4f9d0 100644 --- a/static/docs/commands-reference/metrics/show.md +++ b/static/docs/commands-reference/metrics/show.md @@ -19,9 +19,9 @@ It will find and print all metric files (default) or a specified metric file in the current branch (if `targets` are provided) or across all branches/tags (if `-a` or`-T` specified respectively). -The optional `targets` argument represents several DVC metric files or -directories. If a `target` is a directory, recursively search and process all -metric files in it with the `-R` option. +The optional `targets` argument represents several metric files or directories. +If a `target` is a directory, recursively search and process all metric files in +it with the `-R` option. Providing `type` (via `-t` CLI option), overrides the full metric specification (both, `type` and `xpath`) defined in the DVC-file (usually, using From f9ab91acd772cf68d5683588af1d62a6461dd53c Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 5 Sep 2019 12:53:26 -0500 Subject: [PATCH 25/26] term: review "runs" throughout --- static/docs/changelog/0.35.md | 4 ++-- static/docs/commands-reference/checkout.md | 2 +- static/docs/commands-reference/import.md | 2 ++ static/docs/commands-reference/repro.md | 10 +++++----- static/docs/commands-reference/status.md | 13 +++++++------ static/docs/get-started/example-versioning.md | 12 ++++++------ static/docs/tutorial/define-ml-pipeline.md | 4 ++-- .../docs/user-guide/contributing-documentation.md | 10 ++++++---- 8 files changed, 31 insertions(+), 26 deletions(-) diff --git a/static/docs/changelog/0.35.md b/static/docs/changelog/0.35.md index 202c96fdd1..21915242bb 100644 --- a/static/docs/changelog/0.35.md +++ b/static/docs/changelog/0.35.md @@ -41,8 +41,8 @@ improvements) we have done in the last few months: - We’ve introduced the DVC commit command and `dvc run/repro/add --no-commit` flag to give a way to **avoid uncontrolled cache growth** and as a way to save - some `dvc repro` runs. In the future we plan to have β€œdo-not-cache-my-data” as - a default mode for `dvc run`, `dvc add` and `dvc repro`. + some runs of `dvc repro`. In the future we plan to have β€œdo-not-cache-my-data” + as a default mode for `dvc run`, `dvc add` and `dvc repro`. - **SSH remotes (data storage) support** - config options to set port, key files, timeouts, password, etc + improved stability and Windows support! diff --git a/static/docs/commands-reference/checkout.md b/static/docs/commands-reference/checkout.md index 33e3f275ab..9224594934 100644 --- a/static/docs/commands-reference/checkout.md +++ b/static/docs/commands-reference/checkout.md @@ -205,7 +205,7 @@ MD5 (model.pkl) = a66489653d1b6a8ba989799367b32c43 ``` What happened is that DVC went through the sole existing DVC-file and adjusted -the current set of files to match the `outs` of that stage. `dvc fetch` runs +the current set of files to match the `outs` of that stage. `dvc fetch` is run once to download missing data from the remote storage to the cache. Alternatively, we could have just run `dvc pull` in this case to automatically do `dvc fetch` + `dvc checkout`. diff --git a/static/docs/commands-reference/import.md b/static/docs/commands-reference/import.md index 2b0ad821e9..ce3f1fd0a7 100644 --- a/static/docs/commands-reference/import.md +++ b/static/docs/commands-reference/import.md @@ -111,3 +111,5 @@ outs: Several of the values above are pulled from the original stage file `model.pkl.dvc` in the external DVC repo. `url` and `rev_lock` fields are used to specify the origin and version of the dependency. + + diff --git a/static/docs/commands-reference/repro.md b/static/docs/commands-reference/repro.md index 554ded7502..e6fa96615a 100644 --- a/static/docs/commands-reference/repro.md +++ b/static/docs/commands-reference/repro.md @@ -116,10 +116,10 @@ specified), and updates stage files with the new checksum information. ## Examples -For simplicity, let's build a pipeline defined below (if you want get your hands -on something more real, see this -[mini-tutorial](/doc/get-started/example-pipeline)). It takes this `text.txt` -file: +For simplicity, let's build a pipeline defined below. (If you want get your +hands on something more real, see this shot +[pipeline tutorial](/doc/get-started/example-pipeline)). It takes this +`text.txt` file: ``` dvc @@ -166,7 +166,7 @@ $ tree β”œβ”€β”€ count.txt <---- result: "2" β”œβ”€β”€ filter.dvc <---- first stage β”œβ”€β”€ numbers.txt <---- intermediate result of the first stage -β”œβ”€β”€ process.py <---- code that runs some transformation +β”œβ”€β”€ process.py <---- code that causes data transformation └── text.txt <---- text file to process ``` diff --git a/static/docs/commands-reference/status.md b/static/docs/commands-reference/status.md index c715cced35..2e62d309bd 100644 --- a/static/docs/commands-reference/status.md +++ b/static/docs/commands-reference/status.md @@ -113,14 +113,15 @@ workspace) is different from remote storage. Bringing the two into sync requires `dvc remote list`) to compare against. The argument, `REMOTE`, is a remote name defined using the `dvc remote` command. Implies `--cloud`. -- `-a`, `--all-branches` - compares cache content against all Git branches. - Instead of checking just the current workspace version, it runs the same - status command in all the branches of this repo. The corresponding branches - are shown in the status output. Applies only if `--cloud` or a `-r` remote is - specified. +- `-a`, `--all-branches` - compares cache content against all Git branches + instead of checking just the current workspace version. This basically runs + the same status command in all the branches of this repo. The corresponding + branches are shown in the status output. Applies only if `--cloud` or a `-r` + remote is specified. - `-T`, `--all-tags` - compares cache content against all Git tags instead of - checking just the current workspace version. The corresponding tags are shown + checking just the current workspace version. This basically runs the same + status command in all the tags of this repo. The corresponding tags are shown in the status output. Applies only if `--cloud` or a `-r` remote is specified. - `-j JOBS`, `--jobs JOBS` - specifies the number of jobs DVC can use to diff --git a/static/docs/get-started/example-versioning.md b/static/docs/get-started/example-versioning.md index 27023998f3..9f38ba2c60 100644 --- a/static/docs/get-started/example-versioning.md +++ b/static/docs/get-started/example-versioning.md @@ -43,8 +43,8 @@ $ git clone https://github.com/iterative/example-versioning.git $ cd example-versioning ``` -This command pulls a repository with a single script `train.py` that runs the -training. +This command pulls a repository with a single script `train.py` that will train +the model. Now let's install the requirements. But before we do that, we **strongly** recommend creating a virtual environment with a tool such as @@ -326,10 +326,10 @@ commands. Here we would like to outline some next topics and ideas you would be interested to try to learn more about DVC and how it makes managing ML projects simpler. -First of all, you should have probably noticed that the script that trains a -model is written in a monolithic way. It runs the `save_bottleneck_feature` -function to pre-calculate bottom, "frozen" part of the net every time it is run. -Features are written into files, and intention probably was that the +First of all, you may have noticed that the script that trains the model is +written in a monolithic way. It uses the `save_bottleneck_feature` function to +pre-calculate bottom, "frozen" part of the net every time it is run. Features +are written into files, and intention probably was that the `save_bottleneck_feature` can be commented out after the first run. It's not very convenient to remember to comment/uncomment it every time dataset is changed. diff --git a/static/docs/tutorial/define-ml-pipeline.md b/static/docs/tutorial/define-ml-pipeline.md index de7ba2fcaf..822e150c08 100644 --- a/static/docs/tutorial/define-ml-pipeline.md +++ b/static/docs/tutorial/define-ml-pipeline.md @@ -188,8 +188,8 @@ command. `-d data/Posts.xml.zip` defines the input file and `-o data/Posts.xml` the resulting extracted data file. The `unzip` command extracts data file `data/Posts.xml.zip` to a regular file -`data/Posts.xml`. It knows nothing about data files or DVC. DVC runs the command -and does some additional work if the command was successful: +`data/Posts.xml`. It knows nothing about data files or DVC. DVC executes the +command and does some additional work if the command was successful: 1. DVC transforms all the outputs `-o` files into data files. It is like applying `dvc add` for each of the outputs. As a result, all the actual data diff --git a/static/docs/user-guide/contributing-documentation.md b/static/docs/user-guide/contributing-documentation.md index 0466d99f16..207836a574 100644 --- a/static/docs/user-guide/contributing-documentation.md +++ b/static/docs/user-guide/contributing-documentation.md @@ -64,9 +64,10 @@ $ git clone git@github.com:/dvc.org.git ``` Make sure you have the latest version of [Node.js](https://nodejs.org/en/) and -[yarn](https://yarnpkg.com/en/) installed. Install and keep the dependencies up -to date by running `yarn` often. This will also enable the Git pre-commit hook -that will be formatting your code and documentation files automatically. +[yarn](https://yarnpkg.com/en/) installed. Install the dependencies by running +`yarn`. (Run it continuously as the repository changes to keep the dependencies +up to date.) This will also enable the Git pre-commit hook that will be +formatting your code and documentation files automatically. It's highly recommended to run the Node docs app locally to check documentation changes before submitting them, and its very much needed in order to make @@ -88,7 +89,8 @@ command before committing them. Visual Studio Code and the [Rewrap](https://marketplace.visualstudio.com/items?itemName=stkb.rewrap) plugin. Correct formatting will be done automatically by a Git pre-commit hook - which is integrated when `yarn` runs in the instructions above. + which is integrated when `yarn` installs the project dependencies (explained + in the instructions above). - We use [Prettier](https://prettier.io/) default conventions to format our source code files. The formatting of staged files will automatically be done From 03eaa8b188c73746aacef2a5b164406585811f48 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 5 Sep 2019 13:09:01 -0500 Subject: [PATCH 26/26] term: review usage of "data remote" and include "remote storage" more --- static/docs/commands-reference/remote/add.md | 27 ++++++++++--------- .../docs/commands-reference/remote/default.md | 4 +-- .../docs/commands-reference/remote/index.md | 16 +++++------ static/docs/commands-reference/remote/list.md | 6 ++--- .../docs/commands-reference/remote/modify.md | 8 +++--- .../docs/commands-reference/remote/remove.md | 2 +- .../understanding-dvc/related-technologies.md | 9 ++++--- 7 files changed, 37 insertions(+), 35 deletions(-) diff --git a/static/docs/commands-reference/remote/add.md b/static/docs/commands-reference/remote/add.md index 3e2ae3b22e..2e7a82c334 100644 --- a/static/docs/commands-reference/remote/add.md +++ b/static/docs/commands-reference/remote/add.md @@ -24,18 +24,19 @@ positional arguments: ## Description `name` and `url` are required. `url` specifies a location to store your data. It -could be S3 path, SSH path, Azure, Google cloud, Aliyun OSS local directory, -etc. (See more examples below.) If `url` is a local relative path, it will be -resolved relative to the current working directory but saved **relative to the -config file location** (see LOCAL example below). Whenever possible DVC will -create a remote directory if it doesn't exists yet. It won't create an S3 bucket -though and will rely on default access settings. - -> If you installed DVC via `pip`, depending on the remote type you plan to use -> you might need to install optional dependencies: `[s3]`, `[ssh]`, `[gs]`, -> `[azure]`, and `[oss]`; or `[all]` to include them all. The command should -> look like this: `pip install "dvc[s3]"`. This installs `boto3` library along -> with DVC to support AWS S3 storage. +can be an SSH, S3 path, Azure, Google Cloud address, Aliyun OSS local directory, +etc. (See all the supported remote storage types in the examples below.) If +`url` is a local relative path, it will be resolved relative to the current +working directory but saved **relative to the config file location** (see LOCAL +example below). Whenever possible DVC will create a remote directory if it +doesn't exists yet. It won't create an S3 bucket though and will rely on default +access settings. + +> If you installed DVC via `pip`, depending on the remote storage type you plan +> to use you might need to install optional dependencies: `[s3]`, `[ssh]`, +> `[gs]`, `[azure]`, and `[oss]`; or `[all]` to include them all. The command +> should look like this: `pip install "dvc[s3]"`. This installs `boto3` library +> along with DVC to support AWS S3 storage. This command creates a section in the DVC project's [config file](/doc/commands-reference/config) and optionally assigns a default @@ -78,7 +79,7 @@ Use `dvc config` to unset/change the default remote as so: ## Examples -The following are the types and of remotes (protocols) supported: +The following are the types of remote storage (protocols) supported:
diff --git a/static/docs/commands-reference/remote/default.md b/static/docs/commands-reference/remote/default.md index b9aa8ef804..5bd3069e27 100644 --- a/static/docs/commands-reference/remote/default.md +++ b/static/docs/commands-reference/remote/default.md @@ -2,8 +2,8 @@ Set/unset a default data remote. -> Depending on your storage type, you may also need `dvc remote modify` to -> provide credentials and/or configure other remote parameters. +> Depending on your remote storage type, you may also need `dvc remote modify` +> to provide credentials and/or configure other remote parameters. See also [add](/doc/commands-reference/remote/add), [list](/doc/commands-reference/remote/list), diff --git a/static/docs/commands-reference/remote/index.md b/static/docs/commands-reference/remote/index.md index 8b8348fef8..368168946b 100644 --- a/static/docs/commands-reference/remote/index.md +++ b/static/docs/commands-reference/remote/index.md @@ -25,19 +25,19 @@ positional arguments: What is data remote? -The same way as Github provides storage hosting for Git repositories, DVC data -remotes provide a central place to keep and share data and model files. With a -remote data storage, you can pull models and data files which were created by +The same way as Github provides storage hosting for Git repositories, DVC +remotes provide a central place to keep and share data and model files. With +this remote storage, you can pull models and data files which were created by your team members without spending time and resources to build or process them locally. It also saves space on your local environment – DVC can [fetch](/doc/commands-reference/fetch) into the cache directory only the data you need for a specific branch/commit. -> If you installed DVC via `pip`, depending on the remote type you plan to use -> you might need to install optional dependencies: `[s3]`, `[ssh]`, `[gs]`, -> `[azure]`, and `[oss]`; or `[all]` to include them all. The command should -> look like this: `pip install "dvc[s3]"`. This installs `boto3` library along -> with DVC to support AWS S3 storage. +> If you installed DVC via `pip`, depending on the remote storage type you plan +> to use you might need to install optional dependencies: `[s3]`, `[ssh]`, +> `[gs]`, `[azure]`, and `[oss]`; or `[all]` to include them all. The command +> should look like this: `pip install "dvc[s3]"`. This installs `boto3` library +> along with DVC to support AWS S3 storage. Using DVC with a remote data storage is optional. By default, DVC is configured to use a local data storage only (usually `.dvc/cache` directory inside your diff --git a/static/docs/commands-reference/remote/list.md b/static/docs/commands-reference/remote/list.md index 9ce0b2d19e..02b8058ae5 100644 --- a/static/docs/commands-reference/remote/list.md +++ b/static/docs/commands-reference/remote/list.md @@ -1,6 +1,6 @@ # remote list -Show all available remotes. +Show all available data remotes. See also [add](/doc/commands-reference/remote/add), [default](/doc/commands-reference/remote/default), @@ -15,8 +15,8 @@ usage: dvc remote list [-h] [--global] [--system] [--local] [-q | -v] ## Description -Reads DVC configuration files and prints the list of available remotes. -Including names and URLs. +Reads DVC configuration files and prints the list of available remotes, +including names and URLs. ## Options diff --git a/static/docs/commands-reference/remote/modify.md b/static/docs/commands-reference/remote/modify.md index eaeaf67b3a..76eb907ff9 100644 --- a/static/docs/commands-reference/remote/modify.md +++ b/static/docs/commands-reference/remote/modify.md @@ -1,10 +1,10 @@ # remote modify -Modify configuration of remotes. +Modify configuration of data remotes. > This command is commonly needed after `dvc remote add` or > [default](/doc/commands-reference/remote/default) to setup credentials or -> other customizations to each remote type. +> other customizations to each remote storage type. See also [add](/doc/commands-reference/remote/add), [default](/doc/commands-reference/remote/default), @@ -27,7 +27,7 @@ positional arguments: ## Description Remote `name` and `option` name are required. Option names are remote type -specific. See below examples and a list of per remote type: AWS S3, Google +specific. See below examples and a list of remote storage types: AWS S3, Google Cloud, Azure, SSH, ALiyun OSS, and others. This command modifies a `remote` section in the project's @@ -60,7 +60,7 @@ manual editing could be used to change the configuration. ## Examples -The following are the types and of remotes (protocols) supported: +The following are the types of remote storage (protocols) supported:
diff --git a/static/docs/commands-reference/remote/remove.md b/static/docs/commands-reference/remote/remove.md index 78628f0a53..e21ee00669 100644 --- a/static/docs/commands-reference/remote/remove.md +++ b/static/docs/commands-reference/remote/remove.md @@ -1,6 +1,6 @@ # remote remove -Remove a specified remote. This command affects DVC configuration files only, it +Remove a data remotes. This command affects DVC configuration files only, it does not physically remove data files stored remotely. See also [add](/doc/commands-reference/remote/add), diff --git a/static/docs/understanding-dvc/related-technologies.md b/static/docs/understanding-dvc/related-technologies.md index 75d268796b..01bfe42d8e 100644 --- a/static/docs/understanding-dvc/related-technologies.md +++ b/static/docs/understanding-dvc/related-technologies.md @@ -94,11 +94,12 @@ process. - Git-annex is a datafile-centric system whereas DVC is focused on providing a workflow for machine learning and reproducible experiments. When a DVC or - Git-annex repository is cloned via git clone, data files won't be copied to - the local machine as file content is stored in separate data remotes. However, + Git-annex repository is cloned via `git clone`, data files won't be copied to + the local machine as file contents are stored in separate + [remotes](/doc/commands-reference/remote). With DVC, [DVC-files](/doc/user-guide/dvc-file-format) (which provide the reproducible - workflow) are always included in the cloned Git repository and hence can be - recreated locally with minimal effort. + workflow) are always included in the Git repository and hence can be recreated + locally with minimal effort. - DVC is not fundamentally bound to Git, having the option of changing the repository format.