diff --git a/src/Documentation/glossary.js b/src/Documentation/glossary.js
index d6077d3037..7a4f946619 100644
--- a/src/Documentation/glossary.js
+++ b/src/Documentation/glossary.js
@@ -12,6 +12,9 @@ Directory containing all your project files. For example raw datasets, source
code, ML models, etc. A workspace becomes a **DVC project** when
[\`dvc init\`](/doc/commands-reference/init) is run, and
[DVC-files](/doc/user-guide/dvc-file-format) or stage files are created in it.
+
+Note that [external outputs](/doc/user-guide/external-outputs) also form part
+of your expanded workspace, technically.
`
},
{
@@ -26,7 +29,7 @@ Initialized by running \`dvc init\` in the **workspace**. It will contain the
},
{
name: 'DVC Cache',
- match: ['DVC cache', 'cache', 'cache directory', 'data cache', 'cached'],
+ match: ['DVC cache', 'cache', 'cached'],
desc: `
The DVC cache is a hidden storage (by default located in the \`.dvc/cache\`
directory) for files that are under DVC control, and their different versions.
diff --git a/static/docs/changelog/0.18.md b/static/docs/changelog/0.18.md
index c4eb5e545a..d18c01746a 100644
--- a/static/docs/changelog/0.18.md
+++ b/static/docs/changelog/0.18.md
@@ -17,7 +17,7 @@ really excited to share the progress with you:
- β‘ **DVC just got faster**
- Data files management commands like `dvc add`, `dvc push`, `dvc pull`, etc.
- got up to 10x faster on data sets with large number of files.
+ got up to 10x faster on datasets with large number of files.
- Commands startup latency reduced 3x
diff --git a/static/docs/changelog/0.35.md b/static/docs/changelog/0.35.md
index c3b1910151..21915242bb 100644
--- a/static/docs/changelog/0.35.md
+++ b/static/docs/changelog/0.35.md
@@ -14,7 +14,7 @@ improvements) we have done in the last few months:
- π The [Get Started](/doc/get-started/agenda) section has been simplified
(e.g. to use tags instead of branches) and extended. We have also prepared a
- [Github DVC project ](https://github.com/iterative/example-get-started) that
+ [DVC project on Github](https://github.com/iterative/example-get-started) that
reflects the sequence of chapters in the βget startedβ guide. You can now
download the whole project and reproduce all the models.
@@ -41,8 +41,8 @@ improvements) we have done in the last few months:
- Weβve introduced the DVC commit command and `dvc run/repro/add --no-commit`
flag to give a way to **avoid uncontrolled cache growth** and as a way to save
- some `dvc repro` runs. In the future we plan to have βdo-not-cache-my-dataβ as
- a default mode for `dvc run`, `dvc add` and `dvc repro`.
+ some runs of `dvc repro`. In the future we plan to have βdo-not-cache-my-dataβ
+ as a default mode for `dvc run`, `dvc add` and `dvc repro`.
- **SSH remotes (data storage) support** - config options to set port, key
files, timeouts, password, etc + improved stability and Windows support!
@@ -63,7 +63,7 @@ improvements) we have done in the last few months:
general user experience for the commands that navigate tags or branches (all
the commands that include `--all-bracnhes`, `-a` or `--all-tags`, `-T`).
-There are new [DVC integrations and plugins](/doc/user-guide/plugins) available:
+There are new [integrations and plugins](/doc/user-guide/plugins) available:
- Finally there is an official
[Bash and Zsh completion](/doc/user-guide/autocomplete) for DVC!
diff --git a/static/docs/commands-reference/add.md b/static/docs/commands-reference/add.md
index bea3c4425a..cfb44d799a 100644
--- a/static/docs/commands-reference/add.md
+++ b/static/docs/commands-reference/add.md
@@ -1,7 +1,7 @@
# add
Take a data file or a directory under DVC control (by creating a corresponding
-DVC-file).
+[DVC-file](/doc/user-guide/dvc-file-format)).
## Synopsis
@@ -15,37 +15,43 @@ positional arguments:
## Description
-The `dvc add` command is analogous to the `git add` command. By default an added
-file is committed to the DVC cache. Using the `--no-commit` option, the file
-will not be added to the cache and instead the `dvc commit` command is used when
-(or if) the file is to be committed to the DVC cache.
+The `dvc add` command is analogous to the `git add` command. By default though,
+an added file or directory is also committed to the cache. (Use the
+`--no-commit` option to avoid this, and `dvc commit` as a separate step when
+ready.)
-Under the hood, a few actions are taken for each file in `targets`:
+The `targets` are files or directories to be places under DVC control. These are
+turned into outputs (`outs` field) in a resulting
+[DVC-file](/doc/user-guide/dvc-file-format). (See steps below for more details.)
+Note that target data outside the current workspace is supported,
+which becomes [external outputs](/doc/user-guide/external-outputs).
+
+Under the hood, a few actions are taken for each file (or directory) in
+`targets`:
1. Calculate the file checksum.
-2. Move the file content to the DVC cache (default location is `.dvc/cache`).
-3. Replace the file by a link to the file in the cache (see details below).
-4. Create a corresponding [DVC-file](/doc/user-guide/dvc-file-format) and store
- the MD5 checksum to identify the cache entry.
-5. Add the targets to `.gitignore` (if Git is used in this
- workspace) to prevent it from being committed to the Git
+2. Move the file contents to the cache directory (by default in `.dvc/cache`),
+ using the checksum to form the cached file name.
+3. Replace the file by a link to the file in cache (see details below).
+4. Create a corresponding DVC-file and store the checksum to identify the cached
+ file. Unless the `-f` option is used, the DVC-file name generated by default
+ is `.dvc`, where `` is the file name of the first target.
+5. Unless `dvc init --no-scm` was used when initializing the project, add the
+ `targets` to `.gitignore` in order to prevent them from being committed to
+ the Git repository.
+6. Unless `dvc init --no-scm` was used when initializing the project,
+ instructions are printed showing `git` commands for adding the files to a Git
repository.
-6. Instructions are printed showing `git` commands for adding the files to a Git
- repository. If a different SCM system is being used, use the equivalent
- command for that system or nothing is printed if `--no-scm` was specified for
- the repository.
-
-Unless the `-f` options is used, by default the DVC-file name generated is
-`.dvc`, where `` is file name of the first output (from `targets`).
-The result is data file is added to the DVC cache, and DVC-files can be tracked
-via Git or other version control system. The DVC-file lists the added file as an
-output (`out`), and references the DVC cache entry using the checksum. See
+The result is that the target data gets cached by DVC, and instead small
+DVC-files can be tracked with Git. The DVC-file lists the added file as an
+output (`outs` field), and references the cached file using the checksum. See
[DVC-File Format](/doc/user-guide/dvc-file-format) for more details.
-> Note that DVC-files created by this command are _orphans_: they have no
-> dependencies. _Orphan_ "stage files" are always considered _changed_ by
-> `dvc repro`, which always executes them.
+> Note that DVC-files created by this command are considered _orphans_ because
+> they have no dependencies, only outputs. These _orphan_ "stage files" are
+> always treated as _changed_ by `dvc repro`, which always executes them. See
+> `dvc run` to learn about regular stage files.
By default DVC tries to use reflinks (see
[File link types](/docs/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache)
@@ -59,14 +65,14 @@ to work with directory hierarchies with `dvc add`.
1. With `dvc add --recursive`, the hierarchy is traversed and every file is
added individually as described above. This means every file has its own
- DVC-file, and a corresponding DVC cache entry is made (unless `--no-commit`
- flag is added).
+ DVC-file, and a corresponding cached file is created (unless the
+ `--no-commit` flag is used).
2. When not using `--recursive` a DVC-file is created for the top of the
directory (with default name `dirname.dvc`). Every file in the hierarchy is
- added to the DVC cache (unless `--no-commit` flag is added), but DVC does not
+ added to the cache (unless `--no-commit` flag is added), but DVC does not
produce individual DVC-files for each file in the directory tree. Instead,
- the single DVC-file points to a file in the DVC cache that contains
- references to the files in the added hierarchy.
+ the single DVC-file points to a file in the cache that contains references to
+ the files in the added hierarchy.
In a DVC project, `dvc add` can be used to version control any
data artifact (input, intermediate, or output files and
@@ -84,10 +90,10 @@ and make your project reproducible.
found, a new DVC-file is created using the process described in this command's
description.
-- `--no-commit` - do not put files/directories into cache. A DVC-file is
- created, and an entry is added to `.dvc/state`, while nothing is added to the
- cache. Use `dvc commit` when you are ready to save your results to cache. This
- is analogous to using `git add` before `git commit`.
+- `--no-commit` - do not save outputs to cache. A DVC-file is created, and an
+ entry is added to `.dvc/state`, while nothing is added to the cache. This is
+ analogous to using `git add` before `git commit`. Use `dvc commit` when ready
+ to commit the results to cache.
> The `dvc status` command will mention that the file is `not in cache`.
@@ -194,9 +200,9 @@ Saving information to 'pics.dvc'.
```
There are no DVC-files generated within this directory structure, but the images
-are all added to the DVC cache. DVC prints a message to that effect, saying that
-`md5` values are computed for each directory. A DVC-file is generated for the
-top-level directory, and it contains this:
+are all added to the cache. DVC prints a message to that effect,
+saying that `md5` values are computed for each directory. A DVC-file is
+generated for the top-level directory, and it contains this:
```yaml
md5: df06d8d51e6483ed5a74d3979f8fe42e
@@ -225,7 +231,7 @@ top-level DVC-file is generated. But this is less convenient.
With the `dvc add pics` a single DVC-file is generated, `pics.dvc`, which lets
us treat the entire directory structure in one unit. It lets you pass the whole
-directory tree as a dependency to a `dvc run` stage like so:
+directory tree as a dependency to a `dvc run` stage definition, like this:
```dvc
$ dvc run -f train.dvc \
diff --git a/static/docs/commands-reference/cache/dir.md b/static/docs/commands-reference/cache/dir.md
index 846ff2c390..bdb5ecc265 100644
--- a/static/docs/commands-reference/cache/dir.md
+++ b/static/docs/commands-reference/cache/dir.md
@@ -1,6 +1,6 @@
# cache dir
-Set/unset the cache directory location intuitively (compared to
+Set/unset the cache directory location intuitively (compared to
using `dvc config cache`).
## Synopsis
@@ -16,10 +16,12 @@ positional arguments:
## Description
-Helper to set the `cache.dir` configuration option. Unlike doing so with
-`dvc config cache`, this command transform paths (`value`) that are provided
-relative to the current working directory into paths **relative to the config
-file location**. They are required in the latter form for the config file.
+Helper to set the `cache.dir` configuration option. (See
+[cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory).)
+Unlike doing so with `dvc config cache`, this command transform paths (`value`)
+that are provided relative to the current working directory into paths
+**relative to the config file location**. They are required in the latter form
+for the config file.
## Options
diff --git a/static/docs/commands-reference/cache/index.md b/static/docs/commands-reference/cache/index.md
index b0bd01f166..46e030c10b 100644
--- a/static/docs/commands-reference/cache/index.md
+++ b/static/docs/commands-reference/cache/index.md
@@ -1,6 +1,6 @@
# cache
-Contains a helper command to set the cache directory location:
+Contains a helper command to set the cache directory location:
[dir](/doc/commands-reference/cache/dir).
## Synopsis
@@ -15,12 +15,12 @@ positional arguments:
## Description
-After DVC initialization, a hidden directory `.dvc/` is created with the
-[DVC internal files](/doc/user-guide/dvc-files-and-directories), including the
-default cache directory.
+After DVC initialization, a hidden directory `.dvc/` is created to contain the
+[DVC files and directories](/doc/user-guide/dvc-files-and-directories),
+including the default cache directory.
-The DVC cache is where your data files, models, etc (anything you want to
-version with DVC) are actually stored. The corresponding files you see in the
+The cache is where your data files, models, etc (anything you want to version
+with DVC) are actually stored. The corresponding files you see in the
workspace simply link to the ones in cache. (See
`dvc config cache`, `type` config option, for more information on file links on
different platforms.)
diff --git a/static/docs/commands-reference/checkout.md b/static/docs/commands-reference/checkout.md
index 66480796ef..9224594934 100644
--- a/static/docs/commands-reference/checkout.md
+++ b/static/docs/commands-reference/checkout.md
@@ -16,14 +16,15 @@ positional arguments:
## Description
-[DVC-files](/doc/user-guide/dvc-file-format) in a DVC project
-specify which instance of each data file or directory is to be used, using the
-checksum saved in the `outs` fields. The `dvc checkout` command updates the
-workspace data to match with the cache files corresponding to those checksums.
+[DVC-files](/doc/user-guide/dvc-file-format) in a project specify
+which instance of each data file or directory is to be used, using the checksum
+saved in the `outs` fields. The `dvc checkout` command updates the workspace
+data to match with the cached files corresponding to those
+checksums.
Using an SCM like Git, the DVC-files are kept under version control. At a given
branch or tag of the SCM repository, the DVC-files will contain checksums for
-the corresponding data files kept in the DVC cache. After an SCM command like
+the corresponding data files kept in the cache. After an SCM command like
`git checkout` is run, the DVC-files will change to the state at the specified
branch or commit or tag. Afterwards, the `dvc checkout` command is required in
order to synchronize the data files with the currently checked out DVC-files.
@@ -64,8 +65,8 @@ restoring any file size will be almost instantaneous.
> `cache.slow_link_warning` config option to `false` with `dvc config cache`.
The output of `dvc checkout` does not list which data files were restored. It
-does report removed files and files that DVC was unable to restore due to it
-missing from the cache.
+does report removed files and files that DVC was unable to restore because
+they're missing from the cache.
This command will fail to checkout files that are missing from the cache. In
such a case, `dvc checkout` prints a warning message. Any files that can be
@@ -74,7 +75,7 @@ checked out without error will be restored.
There are two methods to restore a file missing from the cache, depending on the
situation. In some cases a pipeline must be reproduced (using `dvc repro`) to
regenerate its outputs. (See also `dvc pipeline`.) In other cases the cache can
-be pulled from a remote cache using `dvc pull`.
+be pulled from remote storage using `dvc pull`.
## Options
@@ -90,10 +91,9 @@ be pulled from a remote cache using `dvc pull`.
inspect.
- `-f`, `--force` - does not prompt when removing workspace files. Changing the
- current set of DVC-files with SCM commands like `git checkout` can result in
- the need for DVC to remove files which should not exist in the current state
- and are missing in the local cache (they are not committed in DVC terms). This
- option controls whether the user will be asked to confirm these files removal.
+ current set of DVC-files with `git checkout` can result in the need for DVC to
+ remove files that don't match those DVC-file references or are missing from
+ cache. (They are not "committed", in DVC terms.)
- `-h`, `--help` - shows the help message and exit.
@@ -205,8 +205,8 @@ MD5 (model.pkl) = a66489653d1b6a8ba989799367b32c43
```
What happened is that DVC went through the sole existing DVC-file and adjusted
-the current set of files to match the `outs` of that stage. `dvc fetch` command
-runs once to download missing data from the remote storage to the local cache.
+the current set of files to match the `outs` of that stage. `dvc fetch` is run
+once to download missing data from the remote storage to the cache.
Alternatively, we could have just run `dvc pull` in this case to automatically
do `dvc fetch` + `dvc checkout`.
diff --git a/static/docs/commands-reference/commit.md b/static/docs/commands-reference/commit.md
index ded23eae12..4d8d35d42b 100644
--- a/static/docs/commands-reference/commit.md
+++ b/static/docs/commands-reference/commit.md
@@ -1,7 +1,7 @@
# commit
Record changes to the repository by updating
-[DVC-files](/doc/user-guide/dvc-file-format) and saving outputs to
+[DVC-files](/doc/user-guide/dvc-file-format) and saving outputs to the
cache.
## Synopsis
@@ -20,9 +20,8 @@ positional arguments:
The `dvc commit` command is useful for several scenarios where a dataset is
being changed: when a [stage](/doc/commands-reference/run) or
[pipeline](/doc/commands-reference/pipeline) is in development, when one wishes
-to run commands outside the control of DVC, or to force
-[DVC-file](/doc/user-guide/dvc-file-format) updates to save time tying stages or
-a pipeline.
+to run commands outside the control of DVC, or to force DVC-file updates to save
+time tying stages or a pipeline.
- Code or data for a stage is under active development, with rapid iteration of
code, configuration, or data. Run DVC commands (`dvc run`, `dvc repro`, and
@@ -43,29 +42,29 @@ a pipeline.
stages. `dvc commit` can help avoid having to reproduce a pipeline in these
cases by forcing the update of the DVC-files.
-The last two use cases are **not recommended**, and essentially force update the
-DVC-files and save data to cache. They are still useful, but keep in mind that
-DVC can't guarantee reproducibility in those cases β You commit any data you
-want. Let's take a look at what is happening in the fist scenario closely:
+Let's take a look at what is happening in the fist scenario closely. Normally
+DVC commands like `dvc add`, `dvc repro` or `dvc run` commit the data to the
+cache after creating a DVC-file. What _commit_ means is that DVC:
-Normally DVC commands like `dvc add`, `dvc repro` or `dvc run` commit the data
-to the DVC cache after creating a DVC-file. What _commit_ means is
-that DVC:
-
-- Computes a checksum for the file/directory
-- Enters the checksum and file name into the DVC-file
-- Tells the SCM to ignore the file/directory (e.g. add entry to `.gitignore`)
- (Note that if the workspace was initialized with no SCM support
+- Computes a checksum for the file/directory.
+- Enters the checksum and file name into the DVC-file.
+- Tells Git to ignore the file/directory (adding an entry to `.gitignore`).
+ (Note that if the project was initialized with no SCM support
(`dvc init --no-scm`), this does not happen.)
-- Adds the file/directory or to the DVC cache
+- Adds the file/directory or to the cache.
There are many cases where the last step is not desirable (for example rapid
iterations on an experiment). The `--no-commit` option prevents the last step
from occurring (on the commands where it's available), saving time and space by
not storing unwanted data artifacts. Checksums is still computed
-and added to the DVC-file, but the actual data file is not saved in the DVC
-cache. This is where the `dvc commit` command comes into play. It performs that
-last step: storing the file in the DVC cache.
+and added to the DVC-file, but the actual data file is not saved in the cache.
+This is where the `dvc commit` command comes into play. It performs that last
+step (saving the data in cache).
+
+The last two scenarios are **not recommended**. They essentially force-update
+the [DVC-files](/doc/user-guide/dvc-file-format) and save data to cache. They
+are still useful, but keep in mind that DVC can't guarantee reproducibility in
+those cases β where you commit any data you want.
## Options
@@ -131,11 +130,11 @@ $ dvc pull --all-branches --all-tags
Sometimes we want to iterate through multiple changes to configuration, code, or
data, trying multiple options to improve the output of a stage. To avoid filling
-the DVC cache with undesired intermediate results, we can run a
-single stage with `dvc run --no-commit`, or reproduce an entire pipeline using
+the cache with undesired intermediate results, we can run a single
+stage with `dvc run --no-commit`, or reproduce an entire pipeline using
`dvc repro --no-commit`. This prevents data from being pushed to cache. When
development of the stage is finished, `dvc commit` can be used to store data
-files in the DVC cache.
+files in the cache.
In the `featurize.dvc` stage, `src/featurize.py` is executed. A useful change to
make is adjusting a parameter to `CountVectorizer` in that script. Namely,
@@ -149,7 +148,7 @@ bag_of_words = CountVectorizer(stop_words='english',
This option not only changes the trained model, it also introduces a change
which would cause the `featurize.dvc`, `train.dvc` and `evaluate.dvc` stages to
execute if we ran `dvc repro`. But if we want to try several values for this
-option and save only the best result to the DVC cache, we can execute as so:
+option and save only the best result to the cache, we can execute as so:
```dvc
$ dvc repro --no-commit evaluate.dvc
@@ -157,7 +156,7 @@ $ dvc repro --no-commit evaluate.dvc
We can run this command as many times as we like, editing `featurize.py` any way
we like, and so long as we use `--no-commit`, the data does not get saved to the
-DVC cache. But it is instructive to verify that's the case:
+cache. Let's verify that's the case:
First verification:
@@ -173,8 +172,8 @@ train.dvc:
not in cache: model.pkl
```
-And we can look in the DVC cache to see if the new version of `model.pkl` is
-indeed _not in cache_ as claimed. Look at `train.dvc` first:
+Now we can look in the cache directory to see if the new version of `model.pkl`
+is indeed _not in cache_ as claimed. Look at `train.dvc` first:
```yaml
cmd: python src/train.py data/features model.pkl
@@ -194,10 +193,10 @@ wdir: .
```
To verify this instance of `model.pkl` is not in the cache, we must know the
-names of the cache files. In the DVC cache the first two characters of the
-checksum are used as a directory name, and the file name is the remaining
-characters. Therefore, if the file had been committed to the cache it would
-appear in the directory `.dvc/cache/70`. But:
+path to the cached file. In the cache directory, the first two characters of the
+checksum are used as a subdirectory name, and the remaining characters are the
+file name. Therefore, had the file been committed to the cache, it would appear
+in the directory `.dvc/cache/70`. Let's check:
```dvc
$ ls .dvc/cache/70
@@ -215,8 +214,8 @@ $ ls .dvc/cache/70
599f166c2098d7ffca91a369a78b0d
```
-And we've verified that `dvc commit` has saved the changes into the cache, and
-that the new instance of `model.pkl` is in the cache.
+We've verified that `dvc commit` has saved the changes into the cache, and that
+the new instance of `model.pkl` is there.
## Example: Running commands without DVC
diff --git a/static/docs/commands-reference/config.md b/static/docs/commands-reference/config.md
index c6b6fce4aa..d0b760af85 100644
--- a/static/docs/commands-reference/config.md
+++ b/static/docs/commands-reference/config.md
@@ -1,6 +1,6 @@
# config
-Get or set repository or global DVC config options.
+Get or set project-level (or global) DVC configuration options.
## Synopsis
@@ -51,7 +51,7 @@ corresponding config file.
## Configuration sections
These are the `name` parameters that can be used with `dvc config`, or the
-sections in the DVC project config file (`.dvc/config`).
+sections in the project config file (`.dvc/config`).
### core
@@ -83,10 +83,10 @@ remote. See `dvc remote` for more information.
### cache
-The DVC cache is a hidden storage (by default located in the `.dvc/cache`
-directory) for files that are under DVC control, and their different versions.
-(See `dvc cache` and
-[DVC internal files](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory)
+A DVC project cache is the hidden storage (by default located in
+the `.dvc/cache` directory) for files that are under DVC control, and their
+different versions. (See `dvc cache` and
+[DVC Files and Directories](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory)
for more details.)
- `cache.dir` - set/unset cache directory location. A correct value must be
@@ -137,9 +137,9 @@ for more details.)
> These warnings are automatically turned off when `cache.type` is manually
> set.
-- `cache.local` - name of a local remote to use as local cache. This will
- overwrite the value provided to `dvc config cache.dir` or `dvc cache dir`.
- Refer to `dvc remote` for more information on "local remotes".
+- `cache.local` - name of a local remote to use as cache directory. (Refer to
+ `dvc remote` for more information on "local remotes".) This will overwrite the
+ value provided to `dvc config cache.dir` or `dvc cache dir`.
- `cache.ssh` - name of an
[SSH remote to use as external cache](/doc/user-guide/external-outputs#ssh).
diff --git a/static/docs/commands-reference/destroy.md b/static/docs/commands-reference/destroy.md
index bd35dbe1fe..686266986f 100644
--- a/static/docs/commands-reference/destroy.md
+++ b/static/docs/commands-reference/destroy.md
@@ -1,8 +1,8 @@
# destroy
Remove all
-[DVC files and directories](/doc/user-guide/dvc-files-and-directories) from the
-project.
+[DVC files and directories](/doc/user-guide/dvc-files-and-directories) from a
+DVC project.
## Synopsis
@@ -13,16 +13,17 @@ usage: dvc destroy [-h] [-q | -v] [-f]
## Description
`dvc destroy` removes DVC-files, and the entire `.dvc/` meta directory from the
-workspace. Note that the DVC cache will normally be
-removed as well, unless it's set to an external location with `dvc cache dir`.
-(By default a local cache is located in the `.dvc/cache` directory.) If you were
-using [symlinks for linking data](/doc/user-guide/large-dataset-optimization)
-from the cache, DVC will replace them with copies, so that your data is intact
-after the DVC repository destruction.
+workspace. Note that the cache directory will normally
+be removed as well, unless it's set to an external location with
+`dvc cache dir`. (By default a local cache is located in the `.dvc/cache`
+directory.) If you were using
+[symlinks for linking data](/doc/user-guide/large-dataset-optimization) from the
+cache, DVC will replace them with copies, so that your data is intact after the
+DVC repository destruction.
## Options
-- `-f`, `--force` - do not prompt when destroying DVC project.
+- `-f`, `--force` - do not prompt when destroying this project.
- `-h`, `--help` - prints the usage/help message, and exit.
@@ -42,8 +43,7 @@ $ ls -a
.dvc .git code.py foo foo.dvc
$ dvc destroy
-
-This will destroy all information about your pipelines, all data files, as well as cache in .dvc/cache.
+This will destroy all information about your pipelines, all data files...
Are you sure you want to continue?
yes
@@ -64,12 +64,11 @@ $ dvc cache dir /mnt/cache
$ dvc add foo
```
-`dvc cache dir` changed the location of cache storage to external location.
-Content of DVC repository:
+`dvc cache dir` changed the location of the cache directory to an external
+location. Content of workspace:
```dvc
$ ls -a
-
.dvc .git code.py foo foo.dvc
```
@@ -87,7 +86,7 @@ Let's execute `dvc destroy`:
```dvc
$ dvc destroy
-This will destroy all information about your pipelines, all data files, as well as cache in .dvc/cache.
+This will destroy all information about your pipelines, all data files...
Are you sure you want to continue? [y/n]
yes
diff --git a/static/docs/commands-reference/diff.md b/static/docs/commands-reference/diff.md
index faac084c0d..f0bee72ef4 100644
--- a/static/docs/commands-reference/diff.md
+++ b/static/docs/commands-reference/diff.md
@@ -1,10 +1,10 @@
# diff
-Show changes between versions of the DVC repository. It can be narrowed down to
-specific target files and directories under DVC control.
+Show changes between versions of the DVC project. It can be
+narrowed down to specific target files and directories under DVC control.
-> This command requires the repository to be versioned with
-> [Git](https://git-scm.com/).
+> This command requires that the project is a [Git](https://git-scm.com/)
+> repository.
## Synopsis
diff --git a/static/docs/commands-reference/fetch.md b/static/docs/commands-reference/fetch.md
index f61b4f0d6f..3e17ae0f3a 100644
--- a/static/docs/commands-reference/fetch.md
+++ b/static/docs/commands-reference/fetch.md
@@ -1,8 +1,8 @@
# fetch
Get files that are under DVC control from
-[remote](/doc/commands-reference/remote#description) storage into the local
-cache.
+[remote](/doc/commands-reference/remote#description) storage into the
+cache.
## Synopsis
@@ -19,11 +19,12 @@ positional arguments:
## Description
The `dvc fetch` command is a means to download files from remote storage into
-the local cache, but without placing them in the workspace. This
-makes the data files available for linking (or copying) into the workspace.
-(Refer to [dvc config cache.type](/doc/commands-reference/config#cache).) Along
-with `dvc checkout`, it's performed automatically by `dvc pull` when the target
-[DVC-files](/doc/user-guide/dvc-file-format) are not already in the local cache:
+the cache of the project, but without placing them in the
+workspace. This makes the data files available for linking (or
+copying) into the workspace. (Refer to
+[dvc config cache.type](/doc/commands-reference/config#cache).) Along with
+`dvc checkout`, it's performed automatically by `dvc pull` when the target
+[DVC-files](/doc/user-guide/dvc-file-format) are not already in the cache:
```
Controlled files Commands
@@ -34,7 +35,7 @@ remote storage
| +------------+
| - - - - | dvc fetch | ++
v +------------+ + +----------+
-local cache ++ | dvc pull |
+project's cache ++ | dvc pull |
+ +------------+ + +----------+
| - - - - |dvc checkout| ++
| +------------+
@@ -42,22 +43,21 @@ local cache ++ | dvc pull |
workspace
```
-Fetching could be useful when first checking out an existing DVC
-project, since files under DVC control could already exist in remote
-storage, but won't be in your local cache. (Refer to `dvc remote` for more
-information on DVC remotes.) These necessary data or model files are listed as
-dependencies or outputs in a DVC-file (target
-[stage](/doc/commands-reference/run)) so they are required to
-[reproduce](/doc/get-started/reproduce) the corresponding
+Fetching could be useful when first checking out a DVC project,
+since files under DVC control should already exist in remote storage, but won't
+be in the project's cache. (Refer to `dvc remote` for more information on DVC
+remotes.) These necessary data or model files are listed as dependencies or
+outputs in a DVC-file (target [stage](/doc/commands-reference/run)) so they are
+required to [reproduce](/doc/get-started/reproduce) the corresponding
[pipeline](/doc/commands-reference/pipeline). (See
[DVC-File Format](/doc/user-guide/dvc-file-format) for more information on
dependencies and outputs.)
`dvc fetch` ensures that the files needed for a DVC-file to be
-[reproduced](/doc/get-started/reproduce) exist in the local cache. If no
-`targets` are specified, the set of data files to fetch is determined by
-analyzing all DVC-files in the current branch, unless `--all-branches` or
-`--all-tags` is specified.
+[reproduced](/doc/get-started/reproduce) exist in cache. If no `targets` are
+specified, the set of data files to fetch is determined by analyzing all
+DVC-files in the current branch, unless `--all-branches` or `--all-tags` is
+specified.
The default remote is used unless `--remote` is specified. See `dvc remote add`
for more information on how to configure different remote storage providers.
@@ -163,8 +163,8 @@ bigrams-experiment <- use bigrams to improve the model
This project comes with a predefined HTTP
[remote storage](/doc/commands-reference/remote). We can now just run
-`dvc fetch` that will download the most recent `model.pkl`, `data.xml`, and
-other files that are under DVC control into our local cache:
+`dvc fetch` to download the most recent `model.pkl`, `data.xml`, and other files
+that are under DVC control into our local cache.
```dvc
$ dvc status --cloud
@@ -191,14 +191,15 @@ $ tree .dvc
βββ ...
```
-> `dvc status --cloud` (or `-c`) compares local cache vs default remote.
+> `dvc status --cloud` (or `-c`) compares the cache contents vs. the default
+> remote.
As seen above, used without arguments, `dvc fetch` downloads all assets needed
by all DVC-files in the current branch, including for directories. The checksums
`3863d0e317dee0a55c4e59d2ec0eef33` and `42c7025fc0edeb174069280d17add2d4`
correspond to the `model.pkl` file and `data/features/` directory, respectively.
-Let's link files from local cache to the workspace with:
+Let's now link files from the cache to the workspace with:
```dvc
$ dvc checkout
@@ -242,8 +243,7 @@ checksums shown above.
After following the previous example (**Specific stages**), only the files
associated with the `prepare.dvc` stage file have been fetched. Several
-dependencies/outputs of other pipeline stages are still missing from local
-cache:
+dependencies/outputs of other pipeline stages are still missing from the cache:
```dvc
$ dvc status -c
@@ -287,14 +287,15 @@ $ tree .dvc/cache
βββ a9c512fda11293cfee7617b66648dc
```
-Fetching using `--with-deps` starts with the target DVC-file (stage) and
-searches backwards through its pipeline for data files to download into the
-local cache. All the data for the second and third stages ("featurize" and
-"train") has now been downloaded to cache. We could now use `dvc checkout` to
-get the data files needed to reproduce this pipeline up to the third stage into
-the workspace (with `dvc repro train.dvc`).
+Fetching using `--with-deps` starts with the target
+[DVC-file](/doc/user-guide/dvc-file-format) (`train.dvc` stage) and searches
+backwards through its pipeline for data to download into the project's cache.
+All the data for the second and third stages ("featurize" and "train") has now
+been downloaded to the cache. We could now use `dvc checkout` to get the data
+files needed to reproduce this pipeline up to the third stage into the workspace
+(with `dvc repro train.dvc`).
> Note that in this sample project, the last stage file `evaluate.dvc` doesn't
-> add any more data files than those form previous stages so at this point all
-> of the files for this pipeline are in local cache and `dvc status -c` would
-> output `Pipelines are up to date.`
+> add any more data files than those from previous stages. So at this point
+> (after reproducing `train.dvc`) all of the data for this pipeline is cached,
+> and `dvc status -c` would output `Data and pipelines are up to date.`
diff --git a/static/docs/commands-reference/get.md b/static/docs/commands-reference/get.md
index d3d652ed7b..8658bb98a5 100644
--- a/static/docs/commands-reference/get.md
+++ b/static/docs/commands-reference/get.md
@@ -1,7 +1,7 @@
# get
-Download or copy file or directory from another DVC repository (on a git server
-such as Github) into the local file system.
+Download or copy file or directory from another DVC repository (on a Git server
+e.g. Github) into the local file system.
> Unlike `dvc import`, this command does not track the downloaded data files
> (does not create a DVC-file).
@@ -23,9 +23,10 @@ other files and directories tracked in another DVC repository into the current
working directory, regardless of whether it's a DVC project. The `dvc get`
command downloads such a data artifact.
-The `url` argument specifies the external DVC project's Git repository URL (both
-HTTP and SSH protocols supported, e.g. `[user@]server:project.git`), while
-`path` is used to specify the path to the data to be downloaded within the repo.
+The `url` argument specifies the address of the Git repository containing the
+external DVC project (both HTTP and SSH protocols supported, e.g.
+`[user@]server:project.git`). `path` is used to specify the path of the data to
+be downloaded within the repo.
Note that this command doesn't require an existing DVC project to run in. It's a
single-purpose command that can be used out of the box after installing DVC.
@@ -42,7 +43,8 @@ created in the current working directory, with its original file name.
isn't used) is the current working directory (`.`) and original file name.
- `--rev` - specific Git revision of the DVC repository to import the data from.
- `HEAD` by default.
+ The tip of the default branch is used by default when this option is not
+ specified.
- `-h`, `--help` - prints the usage/help message, and exit.
@@ -79,12 +81,12 @@ is found, which specifies `model.pkl` in its outputs (`outs`). DVC then
its
[config file](https://github.com/iterative/example-get-started/blob/master/.dvc/config)).
-A common use for downloading binary files from DVC repos, as done in this
-example, is to place a ML model inside a wrapper application that serves as an
-[ETL](https://en.wikipedia.org/wiki/Extract,_transform,_load) pipeline or as an
-HTTP/RESTful API (web service) that provides predictions upon request. This can
-be automated leveraging DVC with [CI/CD](https://en.wikipedia.org/wiki/CI/CD)
-tools.
+A recommended use for downloading binary files from DVC repositories, as done in
+this example, is to place a ML model inside a wrapper application that serves as
+an [ETL](https://en.wikipedia.org/wiki/Extract,_transform,_load) pipeline or as
+an HTTP/RESTful API (web service) that provides predictions upon request. This
+can be automated leveraging DVC with
+[CI/CD](https://en.wikipedia.org/wiki/CI/CD) tools.
The same example applies to raw or intermediate data files as well, of course,
for cases where we want to download those files and perform some analysis on
diff --git a/static/docs/commands-reference/import-url.md b/static/docs/commands-reference/import-url.md
index 81d5fab644..a1ecd651f3 100644
--- a/static/docs/commands-reference/import-url.md
+++ b/static/docs/commands-reference/import-url.md
@@ -41,8 +41,8 @@ DVC supports [DVC-files](/doc/user-guide/dvc-file-format) which refer to data in
an external location, see
[External Dependencies](/doc/user-guide/external-dependencies). In such a
DVC-file, the `deps` section stores the remote URL, and the `outs` section
-contains the corresponding local path in the workspace. It records enough data
-from the external file or directory to enable DVC to efficiently check it to
+contains the corresponding local path in the workspace. It records metadata from
+the external file or directory, allowing DVC to efficiently check it later and
determine whether the local copy is out of date.
DVC supports several types of (local or) remote locations (protocols):
@@ -184,8 +184,8 @@ outs:
The `etag` field in the DVC-file contains the
[ETag](https://en.wikipedia.org/wiki/HTTP_ETag) recorded from the HTTP request.
-If the remote file changes, its ETag will be different, letting DVC know whether
-its necessary to download it again.
+If the remote file changes, its ETag will be different. This metadata allows DVC
+to determine whether its necessary to download it again.
> See [DVC-File Format](/doc/user-guide/dvc-file-format) for more details on the
> text format above.
@@ -326,7 +326,8 @@ Saving information to 'data.xml.dvc'.
DVC has noticed the "external" data source has changed, and updated the import
stage (reproduced it). In this case it's also necessary to run `dvc repro` so
-that the rest of the pipeline is also run again. We can confirm so with:
+that the rest of the pipeline results are also regenerated. We can confirm so
+with:
```dvc
$ dvc status
@@ -348,6 +349,6 @@ $ dvc status
Data and pipelines are up to date.
```
-`dvc repro` runs again the given stage `prepare.dvc`, noticing that its
-dependency `data/data.xml` has changed. `dvc status` should report "Nothing to
-reproduce." after this.
+`dvc repro` executes the command defined in the given `prepare.dvc` stage after
+noticing that its dependency `data/data.xml` has changed. `dvc status` should
+report "Nothing to reproduce." after this.
diff --git a/static/docs/commands-reference/import.md b/static/docs/commands-reference/import.md
index 86e99838a2..ce3f1fd0a7 100644
--- a/static/docs/commands-reference/import.md
+++ b/static/docs/commands-reference/import.md
@@ -28,10 +28,10 @@ workspace. The `dvc import` command downloads such a data artifact
in a way that it is tracked with DVC, so it can be updated when the external
data source changes.
-The `url` argument specifies the Git repository URL of the external DVC
-project (both HTTP and SSH protocols are supported, e.g.
-`[user@]server:project.git`), while `path` is used to specify the path to the
-data to be downloaded within the repo.
+The `url` argument specifies the address of the Git repository containing the
+external DVC project (both HTTP and SSH protocols supported, e.g.
+`[user@]server:project.git`). `path` is used to specify the path of the data to
+be downloaded within the repo.
> See `dvc import-url` to download and tack data from other supported URLs.
@@ -53,7 +53,7 @@ To actually [track the data](https://dvc.org/doc/get-started/add-files),
Note that import stages are considered always "locked", meaning that if you run
`dvc repro`, they won't be updated. Use `dvc update` on them to update the
-downloaded data artifact from the external DVC repo.
+downloaded data artifact from the external DVC repository.
## Options
@@ -62,7 +62,8 @@ downloaded data artifact from the external DVC repo.
isn't used) is the current working directory (`.`) and original file name.
- `--rev` - specific Git revision of the DVC repository to import the data from.
- `HEAD` by default.
+ The tip of the default branch is used by default when this option is not
+ specified.
- `-h`, `--help` - prints the usage/help message, and exit.
@@ -73,8 +74,8 @@ downloaded data artifact from the external DVC repo.
## Examples
-An obvious case for this command is to import a dataset from an external DVC
-repo, such as our
+A simple case for this command is to import a dataset from an external DVC repo,
+such as our
[get started example repo](https://github.com/iterative/example-get-started).
```dvc
diff --git a/static/docs/commands-reference/index.md b/static/docs/commands-reference/index.md
index 3ae493b0cb..2c3ea0afef 100644
--- a/static/docs/commands-reference/index.md
+++ b/static/docs/commands-reference/index.md
@@ -2,16 +2,16 @@
DVC is a command-line tool. The typical use case for DVC goes as follows:
-- In an existing Git repository, initialize a DVC repository with `dvc init`.
-- Copy source code files for modeling into the repository and convert the files
- into DVC data files with `dvc add` command.
-- Process raw data files through your data processing and modeling code using
- the `dvc run` command.
-- Use `--outs` option to specify `dvc run` command outputs which will be
- converted to DVC data files after the code runs.
-- Clone a git repo with the code of your ML application pipeline. However, this
- will not copy your DVC cache. Use
- [data remotes](/doc/commands-reference/remote) and `dvc push` to share the
- cache (data).
-- Use `dvc repro` to quickly reproduce your pipeline on a new iteration, after
- your data item files or source code of your ML application are modified.
+- In an existing Git repository, initialize a DVC project with
+ `dvc init`.
+- Copy source code files for modeling into the repository and track the files
+ with DVC using the `dvc add` command.
+- Process raw data with your own data processing and modeling code using the
+ `dvc run` command, using the `--outs` option to outputs which will also be
+ tracked by DVC after the code is executed.
+- Sharing a Git repository with the source code of your ML
+ [pipeline](/doc/commands-reference/pipeline) will not include the project's
+ cache. Use [remote storage](/doc/commands-reference/remote) and
+ `dvc push` to share this cache (data tracked by DVC).
+- Use `dvc repro` to automatically reproduce your full pipeline, iteratively as
+ input data or source code change.
diff --git a/static/docs/commands-reference/init.md b/static/docs/commands-reference/init.md
index 8b6cbd0beb..bbbf5e6a17 100644
--- a/static/docs/commands-reference/init.md
+++ b/static/docs/commands-reference/init.md
@@ -1,6 +1,6 @@
# init
-This command initializes a DVC project on a directory.
+This command initializes a DVC project on a directory.
Note that by default the current working directory is expected to contain a Git
repository, unless the `--no-scm` option is used.
@@ -14,7 +14,7 @@ usage: dvc init [-h] [-q | -v] [--no-scm] [-f]
## Description
After DVC initialization, a new directory `.dvc/` will be created with `config`
-and `.gitignore` files, and cache directory. These files and
+and `.gitignore` files, and cache directory. These files and
directories are hidden from the user generally and are not meant to be
manipulated directly.
@@ -22,7 +22,7 @@ manipulated directly.
[DVC directories](/doc/user-guide/dvc-files-and-directories). It will hold all
the contents of tracked data files. Note that `.dvc/.gitignore` lists this
directory, which means that the cache directory is not under Git control. This
-is your local cache and you cannot push it to any Git remote.
+is a local cache and you cannot `git push` it.
## Options
@@ -30,8 +30,8 @@ is your local cache and you cannot push it to any Git remote.
written.
- `-f`, `--force` - remove `.dvc/` if it exists before initialization. Will
- remove all local cache. Useful when first `dvc init` got corrupted for some
- reason.
+ remove any existing local cache. Useful when a previous `dvc init` has been
+ corrupted.
- `-h`, `--help` - prints the usage/help message, and exit.
@@ -42,7 +42,7 @@ is your local cache and you cannot push it to any Git remote.
## Examples
-Creating a new DVC repository (requires a Git repository).
+Create a new DVC repository (requires Git):
```dvc
$ mkdir example && cd example
diff --git a/static/docs/commands-reference/install.md b/static/docs/commands-reference/install.md
index 79d268e4d3..b64fa9e434 100644
--- a/static/docs/commands-reference/install.md
+++ b/static/docs/commands-reference/install.md
@@ -1,6 +1,6 @@
# install
-Install DVC hooks into the Git repository to automate certain common actions.
+Install Git hooks into the DVC repository to automate certain common actions.
## Synopsis
@@ -17,30 +17,28 @@ automatically.
Namely:
-**Checkout**: For any given branch or tag, Git checks out the
+**Checkout**: For any given branch or tag, `git checkout` retrieves the
[DVC-files](/doc/user-guide/dvc-file-format) corresponding to that version. The
-DVC-files in turn refer to data files in the DVC cache by checksum.
-When switching from one SCM branch or tag to another, the SCM retrieves the
-corresponding DVC-files. By default that leaves the project in a
-state where the DVC-files refer to data files other than what is currently in
-the workspace. The user at this point should run `dvc checkout` so
-that the data files will match the current DVC-files.
+project's DVC-files in turn refer to data stored in
+cache, but not necessarily in the workspace. Normally,
+it would be necessary to run `dvc checkout` to synchronize workspace and
+DVC-files.
The installed Git hook automates running `dvc checkout`.
**Commit**: When committing a change to the Git repository, that change possibly
requires reproducing the corresponding
-[pipeline](/doc/commands-reference/pipeline) (with `dvc repro`) to regenerate
-the project results. Or there might be files not yet in the cache, which is a
-reminder to run `dvc commit`.
+[pipeline](/doc/commands-reference/pipeline) (using `dvc repro`) to regenerate
+the project results. Or there might be new data not yet in cache, which requires
+running `dvc commit` to update.
The installed Git hook automates reminding the user to run either `dvc repro` or
-`dvc commit`.
+`dvc commit`, as needed.
**Push**: While publishing changes to the Git remote repository with `git push`,
-it easy to forget that `dvc push` command usually needs to be run to save
-corresponding changes in data files and directories that are under DVC control
-to the DVC remote storage.
+it easy to forget that the `dvc push` command is necessary to upload new or
+updated data files and directories under DVC control to
+[remote storage](/doc/commands-reference/remote).
The installed Git hook automates executing `dvc push`.
@@ -51,7 +49,7 @@ The installed Git hook automates executing `dvc push`.
- A `post-checkout` hook executes `dvc checkout` after `git checkout` to
automatically synchronize the data files with the new workspace state.
- A `pre-push` hook executes `dvc push` before `git push` to upload files and
- directories under DVC control to remote.
+ directories under DVC control to remote storage.
For more information about git hooks, refer to the
[git-scm documentation](https://git-scm.com/docs/githooks).
@@ -285,6 +283,6 @@ Data and pipelines are up to date.
After reproducing this pipeline up to the "evaluate" stage, the data files are
in sync with the code/config files, but we must now commit the changes to the
-Git repository. Looking closely we see that `dvc status` is run again, informing
-us that the data files are synchronized with the `Pipelines are up to date.`
-message.
+Git repository. Looking closely we see that `dvc status` is used again,
+informing us that the data files are synchronized with the
+`Data and pipelines are up to date.` message.
diff --git a/static/docs/commands-reference/lock.md b/static/docs/commands-reference/lock.md
index efa457421b..084d3579ab 100644
--- a/static/docs/commands-reference/lock.md
+++ b/static/docs/commands-reference/lock.md
@@ -4,8 +4,8 @@ Lock a [DVC-file](/doc/user-guide/dvc-file-format)
([stage](/doc/commands-reference/run)). Use `dvc unlock` to unlock the file.
If a DVC-file is locked, the stage is considered unchanged. `dvc repro` will not
-run commands to rebuild outputs of locked stages, even if some dependencies have
-changed and even if `--force` is provided.
+execute commands to regenerate outputs of locked stages, even if some
+dependencies have changed and even if `--force` is provided.
## Synopsis
diff --git a/static/docs/commands-reference/metrics/index.md b/static/docs/commands-reference/metrics/index.md
index 6089498212..a9327bdec1 100644
--- a/static/docs/commands-reference/metrics/index.md
+++ b/static/docs/commands-reference/metrics/index.md
@@ -31,7 +31,7 @@ way to compare and pick the best performing experiment variant.
[show](/doc/commands-reference/metrics/show),
[modify](/doc/commands-reference/metrics/modify), and
[remove](/doc/commands-reference/metrics/remove) commands are available to set
-up and manage DVC metrics.
+up and manage DVC project metrics.
## Options
@@ -56,7 +56,7 @@ $ dvc run -d code/evaluate.py -M data/eval.json \
> running `dvc metrics add data/eval.json` to explicitly mark `data/eval.json`
> as a metric file.
-Now let's print metric values that we are tracking in this DVC project:
+Now let's print metric values that we are tracking in this project:
```dvc
$ dvc metrics show -a
@@ -65,8 +65,8 @@ $ dvc metrics show -a
data/eval.json: {"AUC": "0.624652"}
```
-Then we can tell DVC an `xpath` for the metric file, so that it can output only
-the value of AUC. In the case of JSON, it uses
+We can also tell DVC an `xpath` for the metric file, so that it can output only
+the value of AUC. In the case of JSON, use
[JSONPath expressions](https://goessner.net/articles/JsonPath/index.html) to
selectively extract data out of metric files:
@@ -78,7 +78,7 @@ $ dvc metrics show
data/eval.json: 0.624652
```
-And finally let's remove `data/eval.json` from project's metrics:
+And finally let's remove `data/eval.json` from the project's metrics:
```dvc
$ dvc metrics remove data/eval.json
diff --git a/static/docs/commands-reference/metrics/show.md b/static/docs/commands-reference/metrics/show.md
index 824b005e14..3f3ff4f9d0 100644
--- a/static/docs/commands-reference/metrics/show.md
+++ b/static/docs/commands-reference/metrics/show.md
@@ -19,9 +19,9 @@ It will find and print all metric files (default) or a specified metric file in
the current branch (if `targets` are provided) or across all branches/tags (if
`-a` or`-T` specified respectively).
-The optional `targets` argument represents several DVC metric files or
-directories. If a `target` is a directory, recursively search and process all
-metric files in it with the `-R` option.
+The optional `targets` argument represents several metric files or directories.
+If a `target` is a directory, recursively search and process all metric files in
+it with the `-R` option.
Providing `type` (via `-t` CLI option), overrides the full metric specification
(both, `type` and `xpath`) defined in the DVC-file (usually, using
diff --git a/static/docs/commands-reference/move.md b/static/docs/commands-reference/move.md
index dc9775c612..5196c05fbe 100644
--- a/static/docs/commands-reference/move.md
+++ b/static/docs/commands-reference/move.md
@@ -18,10 +18,10 @@ positional arguments:
## Description
`dvc move` is useful when a `src` file or directory has previously been added to
-DVC with `dvc add`, creating a [DVC-file](/doc/user-guide/dvc-file-format) (with
-`src` as a dependency). `dvc move` behaves like `mv src dst`, moving `src` to
-the given `dst` path, but it also renames and updates the corresponding DVC-file
-appropriately.
+the project with `dvc add`, creating a
+[DVC-file](/doc/user-guide/dvc-file-format) (with `src` as a dependency).
+`dvc move` behaves like `mv src dst`, moving `src` to the given `dst` path, but
+it also renames and updates the corresponding DVC-file appropriately.
> Note that `src` may be a copy or a
> [link](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache)
diff --git a/static/docs/commands-reference/pipeline/index.md b/static/docs/commands-reference/pipeline/index.md
index 3eb03b8294..40c5bb6175 100644
--- a/static/docs/commands-reference/pipeline/index.md
+++ b/static/docs/commands-reference/pipeline/index.md
@@ -17,16 +17,19 @@ positional arguments:
## Description
-A data pipeline, in general, is a chain of commands that process data files. It
-produces intermediate data and a final result. For example, Machine Learning
-(ML) pipelines typically start a with large raw datasets, include featurization
-and training intermediate stages, and produce a final model, as well as certain
-metrics.
-
-In DVC, pipeline stage files and commands, their data I/O, interdependencies,
-and results (intermediate or final) are defined with `dvc add` and `dvc run`,
-among other commands. This allows us to form one or more pipelines of stages
-connected by their dependencies and outputs.
+A data pipeline, in general, is a series of data processes (for example console
+commands that take an input and produce an output). A pipeline may produce
+intermediate data, and has a final result. Machine Learning (ML) pipelines
+typically start a with large raw datasets, include intermediate featurization
+and training stages, and produce a final model, as well as accuracy metrics.
+
+In DVC, pipeline stages and commands, their data I/O, interdependencies, and
+results (intermediate or final) are defined with `dvc add` and `dvc run`, among
+other commands. This allows DVC to restore one or more pipelines of stages
+interconnected by their dependencies and outputs later. (See `dvc repro`.)
+
+> DVC builds a dependency graph
+> ([DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph)) to do this.
`dvc pipeline` commands help users display the existing project pipelines in
different ways.
diff --git a/static/docs/commands-reference/pipeline/show.md b/static/docs/commands-reference/pipeline/show.md
index 71ec7396ba..8c68007f68 100644
--- a/static/docs/commands-reference/pipeline/show.md
+++ b/static/docs/commands-reference/pipeline/show.md
@@ -118,10 +118,10 @@ $ dvc pipeline show eval.txt.dvc --ascii
`--------------'
```
-List dependencies recursively if graph have tree structure:
+List dependencies recursively if the graph has a tree structure:
```dvc
-dvc pipeline show e.file.dvc --tree
+$ dvc pipeline show e.file.dvc --tree
e.file.dvc
βββ c.file.dvc
β βββ b.file.dvc
diff --git a/static/docs/commands-reference/pull.md b/static/docs/commands-reference/pull.md
index 8dfd7bdc27..3ea32a1a16 100644
--- a/static/docs/commands-reference/pull.md
+++ b/static/docs/commands-reference/pull.md
@@ -1,9 +1,9 @@
# pull
Downloads missing files and directories from
-[remote storage](/doc/commands-reference/remote) to the local cache
-based on [DVC-files](/doc/user-guide/dvc-file-format) in the
-workspace, then links the downloaded files into the workspace.
+[remote storage](/doc/commands-reference/remote) to the cache based
+on [DVC-files](/doc/user-guide/dvc-file-format) in the workspace,
+then links the downloaded files into the workspace.
## Synopsis
@@ -43,9 +43,9 @@ only the files (or directories) missing from the workspace by searching all
versions or branches of the repository if using Git, nor will it download files
which have not changed.
-The command `dvc status -c` can list files that are missing in the local cache
-but referenced in the current project DVC-files. It can be used to see what
-files `dvc pull` would download.
+The command `dvc status -c` can list files referenced in current DVC-files, but
+missing in the cache. It can be used to see what files `dvc pull`
+would download.
If one or more `targets` are specified, DVC only considers the files associated
with those DVC-files. Using the `--with-deps` option, DVC tracks dependencies
@@ -59,12 +59,10 @@ reflinks or hardlinks to put it in the workspace without copying. See
## Options
-- `-r REMOTE`, `--remote REMOTE` specifies which remote cache (see
- `dvc remote list`) to pull from. The value for `REMOTE` is a cache name
- defined using the `dvc remote` command. If no `REMOTE` is given, or if no
- remote's are defined in the project, an error message is printed. If the
- option is not specified, then the default remote, configured with the
- `core.config` config option, is used.
+- `-r REMOTE`, `--remote REMOTE` specifies which remote to pull from (see
+ `dvc remote list`). The value for `REMOTE` is a name defined using
+ `dvc remote`. If the option is not specified, then the default remote
+ (configured with the `core.config` config option) is used.
- `-a`, `--all-branches` - determines the files to download by examining
DVC-files in all branches of the project repository (if using Git). It's
@@ -84,15 +82,15 @@ reflinks or hardlinks to put it in the workspace without copying. See
each target directory and its subdirectories for DVC-files to inspect.
- `-f`, `--force` - does not prompt when removing workspace files, which occurs
- when these file no longer match the DVC-file references. This option surfaces
- behavior from the `dvc fetch` and `dvc checkout` commands because `dvc pull`
- in effect performs those 2 functions in a single command.
+ when these file no longer match the current DVC-file references. This option
+ surfaces behavior from the `dvc fetch` and `dvc checkout` commands because
+ `dvc pull` in effect performs those 2 functions in a single command.
- `-j JOBS`, `--jobs JOBS` - specifies number of jobs to run simultaneously
- while downloading files from the remote cache. The effect is to control the
- number of files downloaded simultaneously. Default is `4 * cpu_count()`. For
- example with `-j 1` DVC downloads one file at a time, with `-j 2` it downloads
- two at a time, and so forth. For SSH remotes default is set to 4.
+ while downloading files from the remote. The effect is to control the number
+ of files downloaded simultaneously. Default is `4 * cpu_count()`. For example
+ with `-j 1` DVC downloads one file at a time, with `-j 2` it downloads two at
+ a time, and so forth. For SSH remotes default is set to 4.
- `-h`, `--help` - prints the usage/help message, and exit.
@@ -110,16 +108,16 @@ done and set a context for the example, let's define an SSH remote with the
`dvc remote add` command:
```dvc
-$ dvc remote add r1 ssh://_username_@_host_/path/to/dvc/cache/directory
+$ dvc remote add r1 ssh://_username_@_host_/path/to/dvc/remote/storage
$ dvc remote list
-r1 ssh://_username_@_host_/path/to/dvc/cache/directory
+r1 ssh://_username_@_host_/path/to/dvc/remote/storage
```
> DVC supports several remote types. For details, see the
> [`remote add`](/doc/commands-reference/remote/add) documentation.
-With a remote cache containing some images and other files, we can pull all
-changed files from the current Git branch:
+Having some images and other files in remote storage, we can pull all changed
+files from the current Git branch:
```dvc
$ dvc pull --remote r1
@@ -160,8 +158,8 @@ model.p.dvc
Dvcfile
```
-Imagine the remote storage has been modified such that the data files in some of
-these stages should be updated into the local cache.
+Imagine the remote storage has been modified such that the data in some of these
+stages should be updated in the workspace.
```dvc
$ dvc status --cloud
diff --git a/static/docs/commands-reference/push.md b/static/docs/commands-reference/push.md
index 7ee6436c53..97668e0981 100644
--- a/static/docs/commands-reference/push.md
+++ b/static/docs/commands-reference/push.md
@@ -31,16 +31,16 @@ save any changes in the code or DVC-files. Those should be saved by using
Under the hood a few actions are taken:
- The push command by default uses all
- [DVC-files](/doc/user-guide/dvc-file-format in the current version. The
+ [DVC-files](/doc/user-guide/dvc-file-format in the workspace. The
command-line options listed below will either limit or expand the set of
DVC-files to consult.
-- For each output referenced from each selected DVC-files, it finds a
- corresponding entry in the local cache. DVC checks if the entry
- exists, or not, in the remote simply by looking for it using the checksum.
- From this DVC gathers a list of files missing from the remote storage.
+- For each output referenced from each selected DVC-file, DVC finds a
+ corresponding entry in thecache. DVC checks whether the entry
+ exists in the remote. From this DVC gathers a list of files missing from the
+ remote storage.
-- Upload the cache files missing from the remote cache, if any, to the remote.
+- Upload the cache files missing from remote storage, if any, to the remote.
The DVC `push` command always works with a remote storage, and it is an error if
none are specified on the command line nor in the configuration. If a
@@ -50,19 +50,15 @@ and this [example](/doc/get-started/configure) for more information on how to
configure a remote.
With no arguments, just `dvc push` or `dvc push --remote REMOTE`, it uploads
-only the files (or directories) that are new in the local repository to the
-remote cache. It will not upload files associated with earlier versions or
-branches of the project directory, nor will it upload files which
-have not changed.
+only the files (or directories) that are new in the local repository to remote
+storage. It will not upload files associated with earlier versions or branches
+of the project directory, nor will it upload files which have not
+changed.
-The command `dvc status -c` can list files that are new in the local cache and
-are referenced in the workspace. It can be used to see what files
+The `dvc status -c` command can list files tracked by DVC that are new in the
+cache (compared to the default remote.) It can be used to see what files
`dvc push` would upload.
-The `dvc status -c` command can show files which exist in the remote cache and
-not exist in the local cache. Running `dvc push` from the local cache does not
-remove nor modify those files in the remote cache.
-
If one or more `targets` are specified, DVC only considers the files associated
with those DVC-files. Using the `--with-deps` option, DVC tracks dependencies
backward from the target [stage files](/doc/commands-reference/run), through the
@@ -71,12 +67,10 @@ to push.
## Options
-- `-r REMOTE`, `--remote REMOTE` specifies which remote cache (see
- `dvc remote list`) to push to. The value for `REMOTE` is a cache name defined
- using the `dvc remote` command. If no `REMOTE` is given, or if no remote's are
- defined in the project, an error message is printed. If the option is not
- specified, then the default remote, configured with the `core.config` config
- option, is used.
+- `-r REMOTE`, `--remote REMOTE` specifies which remote to push from (see
+ `dvc remote list`). The value for `REMOTE` is a name defined using
+ `dvc remote`. If the option is not specified, then the default remote
+ (configured with the `core.config` config option) is used.
- `-a`, `--all-branches` - determines the files to upload by examining DVC-files
in all branches of the project repository (if using Git). It's useful if
@@ -96,10 +90,10 @@ to push.
each target directory and its subdirectories for DVC-files to inspect.
- `-j JOBS`, `--jobs JOBS` - specifies number of jobs to run simultaneously
- while uploading files to the remote cache. The effect is to control the number
- of files uploaded simultaneously. Default is `4 * cpu_count()`. For example
- with `-j 1` DVC uploads one file at a time, with `-j 2` it uploads two at a
- time, and so forth. For SSH remotes default is set to 4.
+ while uploading files to the remote. The effect is to control the number of
+ files uploaded simultaneously. Default is `4 * cpu_count()`. For example with
+ `-j 1` DVC uploads one file at a time, with `-j 2` it uploads two at a time,
+ and so forth. For SSH remotes default is set to 4.
- `-h`, `--help` - prints the usage/help message, and exit.
@@ -163,8 +157,8 @@ model.p.dvc
Dvcfile
```
-Imagine the local cache has been modified such that the data files in some of
-these stages should be uploaded to the remote cache.
+Imagine the project has been modified such that the output of some of these
+stages should be uploaded to remote storage.
```dvc
$ dvc status --cloud
@@ -211,15 +205,16 @@ double check that all data had been uploaded.
## Example: What happens in the cache
-Let's take a detailed look at what happens to the DVC cache as you run an
-experiment locally and push data to a remote cache. To set the example consider
-having created a workspace that contains some code and data, and
-having set up a remote cache.
+Let's take a detailed look at what happens to the
+[cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory)
+as you run an experiment locally and push data to remote storage. To set the
+example consider having created a workspace that contains some code
+and data, and having set up a remote.
-Some work has been performed in the local workspace, and it contains new data to
-upload to the shared remote cache. When running `dvc status --cloud` the report
-will list several files in `new` state. By looking in the cache directories we
-can see exactly what that means.
+Some work has been performed in the workspace, and it contains new data to
+upload to the shared remote. When running `dvc status --cloud` the report will
+list several files in `new` state. We can see exactly what that means by looking
+in the project's cache:
```dvc
$ tree .dvc/cache
@@ -262,16 +257,15 @@ $ tree ../vault/recursive
```
The directory `.dvc/cache` is the local cache, while `../vault/recursive` is the
-remote cache. This listing clearly shows the local cache has more files in it
-than the remote cache. Therefore `new` literally means that new files exist in
-the local cache compared to the remote.
+remote storage β a "local remote" in this case. This listing shows the cache
+having more files in it than the remote does (which is what `new` means).
-Next we can upload part of the data from the local cache to a remote using the
-command `dvc push --with-deps STAGE.dvc`. Remember that `--with-deps` searches
+Next we can upload part of the data from the cache to the remote using the
+command `dvc push --with-deps .dvc`. Remember that `--with-deps` searches
backwards from the DVC-file `targets` to locate files to upload, and does not
upload files in subsequent stages.
-After doing that we can inspect the remote cache again:
+After doing that we can inspect the remote storage again:
```dvc
$ tree ../vault/recursive
@@ -296,13 +290,13 @@ $ tree ../vault/recursive
8 directories, 8 files
```
-The remote cache now has some of the files which had been missing, but not all
+The remote storage now has some of the files which had been missing, but not all
of them. Indeed `dvc status --cloud` still lists a couple files as `new`. We can
-clearly see this in that a couple files are in the local cache and not in the
-remote cache.
+clearly see this above, since a couple files are in the cache, but not in the
+remote.
-After running `dvc push` to cause all files to be uploaded the remote cache now
-has all the files:
+After running `dvc push` to cause all files to be uploaded, the remote storage
+now contains all of them:
```dvc
$ tree ../vault/recursive
@@ -335,5 +329,5 @@ $ dvc status --cloud
Data and pipelines are up to date.
```
-And running `dvc status --cloud` verifies that indeed there are no more files to
-upload to the remote cache.
+And running `dvc status --cloud`, DVC verifies that indeed there are no more
+files to push to remote storage.
diff --git a/static/docs/commands-reference/remote/add.md b/static/docs/commands-reference/remote/add.md
index fb0f467e27..2e7a82c334 100644
--- a/static/docs/commands-reference/remote/add.md
+++ b/static/docs/commands-reference/remote/add.md
@@ -24,20 +24,21 @@ positional arguments:
## Description
`name` and `url` are required. `url` specifies a location to store your data. It
-could be S3 path, SSH path, Azure, Google cloud, Aliyun OSS local directory,
-etc. (See more examples below.) If `url` is a local relative path, it will be
-resolved relative to the current working directory but saved **relative to the
-config file location** (see LOCAL example below). Whenever possible DVC will
-create a remote directory if it doesn't exists yet. It won't create an S3 bucket
-though and will rely on default access settings.
-
-> If you installed DVC via `pip`, depending on the remote type you plan to use
-> you might need to install optional dependencies: `[s3]`, `[ssh]`, `[gs]`,
-> `[azure]`, and `[oss]`; or `[all]` to include them all. The command should
-> look like this: `pip install "dvc[s3]"`. This installs `boto3` library along
-> with DVC to support AWS S3 storage.
-
-This command creates a section in the DVC
+can be an SSH, S3 path, Azure, Google Cloud address, Aliyun OSS local directory,
+etc. (See all the supported remote storage types in the examples below.) If
+`url` is a local relative path, it will be resolved relative to the current
+working directory but saved **relative to the config file location** (see LOCAL
+example below). Whenever possible DVC will create a remote directory if it
+doesn't exists yet. It won't create an S3 bucket though and will rely on default
+access settings.
+
+> If you installed DVC via `pip`, depending on the remote storage type you plan
+> to use you might need to install optional dependencies: `[s3]`, `[ssh]`,
+> `[gs]`, `[azure]`, and `[oss]`; or `[all]` to include them all. The command
+> should look like this: `pip install "dvc[s3]"`. This installs `boto3` library
+> along with DVC to support AWS S3 storage.
+
+This command creates a section in the DVC project's
[config file](/doc/commands-reference/config) and optionally assigns a default
remote in the core section if the `--default` option is used:
@@ -74,11 +75,11 @@ Use `dvc config` to unset/change the default remote as so:
using this remote by default to save or retrieve data files unless `-r` option
is specified for them.
-- `-f`, `--force` - to overwrite existing remote with new `url` value.
+- `-f`, `--force` - overwrite existing remote with new `url` value.
## Examples
-The following are the types and of remotes (protocols) supported:
+The following are the types of remote storage (protocols) supported:
@@ -195,7 +196,7 @@ $ dvc remote modify myremote connection_string my-connection-string --local
```
> The connection string contains access to data and is inserted into the
-> `.dvc/config file.` Therefore, it is safer to add the connection string with
+> `.dvc/config` file. Therefore, it is safer to add the connection string with
> the `--local` option, enforcing it to be written to a Git-ignored config file.
The Azure Blob Storage remote can also be configured entirely via environment
@@ -340,7 +341,7 @@ Setting 'myremote' as a default remote.
$ dvc remote modify myremote region us-east-2
```
-DVC config file (`.dvc/config`) now looks like this:
+The project's config file (`.dvc/config`) now looks like this:
```ini
['remote "myremote"']
diff --git a/static/docs/commands-reference/remote/default.md b/static/docs/commands-reference/remote/default.md
index b9aa8ef804..5bd3069e27 100644
--- a/static/docs/commands-reference/remote/default.md
+++ b/static/docs/commands-reference/remote/default.md
@@ -2,8 +2,8 @@
Set/unset a default data remote.
-> Depending on your storage type, you may also need `dvc remote modify` to
-> provide credentials and/or configure other remote parameters.
+> Depending on your remote storage type, you may also need `dvc remote modify`
+> to provide credentials and/or configure other remote parameters.
See also [add](/doc/commands-reference/remote/add),
[list](/doc/commands-reference/remote/list),
diff --git a/static/docs/commands-reference/remote/index.md b/static/docs/commands-reference/remote/index.md
index 9c485649b0..368168946b 100644
--- a/static/docs/commands-reference/remote/index.md
+++ b/static/docs/commands-reference/remote/index.md
@@ -25,19 +25,19 @@ positional arguments:
What is data remote?
-The same way as Github provides storage hosting for Git repositories, DVC data
-remotes provide a central place to keep and share data and model files. With a
-remote data storage, you can pull models and data files which were created by
+The same way as Github provides storage hosting for Git repositories, DVC
+remotes provide a central place to keep and share data and model files. With
+this remote storage, you can pull models and data files which were created by
your team members without spending time and resources to build or process them
locally. It also saves space on your local environment β DVC can
-[fetch](/doc/commands-reference/fetch) into the local cache only the data you
-need for a specific branch/commit.
+[fetch](/doc/commands-reference/fetch) into the cache directory
+only the data you need for a specific branch/commit.
-> If you installed DVC via `pip`, depending on the remote type you plan to use
-> you might need to install optional dependencies: `[s3]`, `[ssh]`, `[gs]`,
-> `[azure]`, and `[oss]`; or `[all]` to include them all. The command should
-> look like this: `pip install "dvc[s3]"`. This installs `boto3` library along
-> with DVC to support AWS S3 storage.
+> If you installed DVC via `pip`, depending on the remote storage type you plan
+> to use you might need to install optional dependencies: `[s3]`, `[ssh]`,
+> `[gs]`, `[azure]`, and `[oss]`; or `[all]` to include them all. The command
+> should look like this: `pip install "dvc[s3]"`. This installs `boto3` library
+> along with DVC to support AWS S3 storage.
Using DVC with a remote data storage is optional. By default, DVC is configured
to use a local data storage only (usually `.dvc/cache` directory inside your
@@ -85,7 +85,7 @@ $ dvc remote list
myremote /path/to/remote
```
-DVC config file would look like:
+The project's config file would look like:
```ini
['remote "myremote"']
diff --git a/static/docs/commands-reference/remote/list.md b/static/docs/commands-reference/remote/list.md
index 9ce0b2d19e..02b8058ae5 100644
--- a/static/docs/commands-reference/remote/list.md
+++ b/static/docs/commands-reference/remote/list.md
@@ -1,6 +1,6 @@
# remote list
-Show all available remotes.
+Show all available data remotes.
See also [add](/doc/commands-reference/remote/add),
[default](/doc/commands-reference/remote/default),
@@ -15,8 +15,8 @@ usage: dvc remote list [-h] [--global] [--system] [--local] [-q | -v]
## Description
-Reads DVC configuration files and prints the list of available remotes.
-Including names and URLs.
+Reads DVC configuration files and prints the list of available remotes,
+including names and URLs.
## Options
diff --git a/static/docs/commands-reference/remote/modify.md b/static/docs/commands-reference/remote/modify.md
index 2df6ea35b9..76eb907ff9 100644
--- a/static/docs/commands-reference/remote/modify.md
+++ b/static/docs/commands-reference/remote/modify.md
@@ -1,10 +1,10 @@
# remote modify
-Modify configuration of remotes.
+Modify configuration of data remotes.
> This command is commonly needed after `dvc remote add` or
> [default](/doc/commands-reference/remote/default) to setup credentials or
-> other customizations to each remote type.
+> other customizations to each remote storage type.
See also [add](/doc/commands-reference/remote/add),
[default](/doc/commands-reference/remote/default),
@@ -27,10 +27,10 @@ positional arguments:
## Description
Remote `name` and `option` name are required. Option names are remote type
-specific. See below examples and a list of per remote type: AWS S3, Google
+specific. See below examples and a list of remote storage types: AWS S3, Google
Cloud, Azure, SSH, ALiyun OSS, and others.
-This command modifies a `remote` section in the DVC project's
+This command modifies a `remote` section in the project's
[config file](/doc/commands-reference/config). Alternatively, `dvc config` or
manual editing could be used to change the configuration.
@@ -60,7 +60,7 @@ manual editing could be used to change the configuration.
## Examples
-The following are the types and of remotes (protocols) supported:
+The following are the types of remote storage (protocols) supported:
@@ -122,7 +122,7 @@ these settings, you could use the following options:
```
- `acl` - set object level access control list (ACL) such as `private`,
-`public-read`, etc. By default, no ACL is specified.
+ `public-read`, etc. By default, no ACL is specified.
```dvc
$ dvc remote modify myremote acl bucket-owner-full-control
@@ -263,13 +263,14 @@ For more information on configuring Azure Storage connection strings, visit
```dvc
$ dvc remote modify myremote ask_password true
```
-
-- `gss_auth` - use Generic Security Services authentication if available on
- host (for example, [with kerberos](https://en.wikipedia.org/wiki/Generic_Security_Services_Application_Program_Interface#Relationship_to_Kerberos)).
+
+- `gss_auth` - use Generic Security Services authentication if available on host
+ (for example,
+ [with kerberos](https://en.wikipedia.org/wiki/Generic_Security_Services_Application_Program_Interface#Relationship_to_Kerberos)).
Using this option requires `paramiko[gssapi]` which is currently only
supported by our pip package and could be installed with
- `pip install 'dvc[ssh_gssapi]'`. Other packages (Conda, Windows, Homebrew
- cask and Mac pkg) do not support it.
+ `pip install 'dvc[ssh_gssapi]'`. Other packages (Conda, Windows, Homebrew cask
+ and Mac pkg) do not support it.
```dvc
$ dvc remote modify myremote gss_auth true
diff --git a/static/docs/commands-reference/remote/remove.md b/static/docs/commands-reference/remote/remove.md
index 78628f0a53..e21ee00669 100644
--- a/static/docs/commands-reference/remote/remove.md
+++ b/static/docs/commands-reference/remote/remove.md
@@ -1,6 +1,6 @@
# remote remove
-Remove a specified remote. This command affects DVC configuration files only, it
+Remove a data remotes. This command affects DVC configuration files only, it
does not physically remove data files stored remotely.
See also [add](/doc/commands-reference/remote/add),
diff --git a/static/docs/commands-reference/remove.md b/static/docs/commands-reference/remove.md
index 3eb045c710..ca389e20f2 100644
--- a/static/docs/commands-reference/remove.md
+++ b/static/docs/commands-reference/remove.md
@@ -8,8 +8,7 @@ Properly remove data files or directories tracked by DVC.
usage: dvc remove [-h] [-q | -v] [-o | -p] [-f] targets [targets ...]
positional arguments:
- targets DVC-files to remove. Optional. (Finds all
- DVC-files in the workspace by default.)
+ targets DVC-files to remove.
```
## Description
diff --git a/static/docs/commands-reference/repro.md b/static/docs/commands-reference/repro.md
index 1f47c3cc4b..e6fa96615a 100644
--- a/static/docs/commands-reference/repro.md
+++ b/static/docs/commands-reference/repro.md
@@ -1,9 +1,9 @@
# repro
-Run again commands recorded in the [stages](/doc/commands-reference/run) of one
-or more [pipelines](/doc/commands-reference/pipeline), in the correct order. The
-commands to be run are determined by recursively analyzing target stages and
-changes in their dependencies.
+Reproduce complete or partial [pipelines](/doc/commands-reference/pipeline) by
+executing commands defined in their [stages](/doc/commands-reference/run), in
+the correct order. The commands to be executed are determined by recursively
+analyzing dependencies and outputs of the target stages.
## Synopsis
@@ -18,13 +18,17 @@ positional arguments:
## Description
-`dvc repro` provides an interface to run the commands in a computational graph
-(a.k.a. pipeline) again, as defined in the
-[stage files](/doc/commands-reference/run) (DVC-files) found in the
-project. (A pipeline is typically defined using the `dvc run`
-command, while data input nodes are defined by the `dvc add` command.)
+`dvc repro` provides an way to regenerate data pipeline results, by restoring
+the dependency graph (a
+[DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph)) implicitly defined
+by [stage files](/doc/commands-reference/run) (DVC-files with dependencies) that
+are found in the project. The commands defined in these stages can
+then be executed in the correct order, reproducing pipeline results.
-There's a few ways to restrict the stages that will be run again by this
+> Pipeline stages are typically defined using the `dvc run` command, while
+> initial data dependencies can be registered by the `dvc add` command.
+
+There's a few ways to restrict the stages that will be regenerated by this
command: by specifying stage file `targets`, or by using the `--single-item`,
`--cwd`, or other options.
@@ -33,7 +37,7 @@ omitted, `Dvcfile` will be assumed.
By default, this command recursively searches in pipeline stages, starting from
the `targets`, to determine which ones have changed. Then it executes the
-corresponding commands again.
+corresponding commands.
`dvc repro` does not run `dvc fetch`, `dvc pull` or `dvc checkout` to get data
files, intermediate or final results. It saves all the data files, intermediate
@@ -43,11 +47,11 @@ specified), and updates stage files with the new checksum information.
## Options
- `-f`, `--force` - reproduce a pipeline, regenerating its results, even if no
- changes were found. By default this runs all of its stages but it can be
+ changes were found. By default this executes all of its stages but it can be
limited with the `targets` argument and `-s`, `-p`, or `-c` options.
- `-s`, `--single-item` - reproduce only a single stage by turning off the
- recursive search for changed dependencies. Multiple stages are run
+ recursive search for changed dependencies. Multiple stages are executed
(non-recursively) if multiple stage files are given as `targets`.
- `-c`, `--cwd` - directory within the project to reproduce from. If no
@@ -63,9 +67,9 @@ specified), and updates stage files with the new checksum information.
searching each target directory and its subdirectories for DVC-files to
inspect.
-- `--no-commit` - do not save outputs to cache. Useful when running different
- experiments and you don't want to fill up the cache with temporary files. Use
- `dvc commit` when ready to save results to cache.
+- `--no-commit` - do not save outputs to cache. (See `dvc run`.) Useful when
+ running different experiments and you don't want to fill up the cache with
+ temporary files. Use `dvc commit` when ready to commit the results to cache.
- `-m`, `--metrics` - show metrics after reproduction. The target pipelines must
have at least one metrics file defined either with the `dvc metrics` command,
@@ -75,7 +79,7 @@ specified), and updates stage files with the new checksum information.
executing the commands.
- `-i`, `--interactive` - ask for confirmation before reproducing each stage.
- The stage is only run if the user types "y".
+ The stage is only executed if the user types "y".
- `-p`, `--pipeline` - reproduce the entire pipelines that the stage file
`targets` belong to. Use `dvc pipeline show .dvc` to show the parent
@@ -92,30 +96,30 @@ specified), and updates stage files with the new checksum information.
`requirements.txt`, we can specify it only once in `A`, omitting it in `B` and
`C`. To be precise , it reproduces all descendants of a changed stage or the
stages following the changed stage, even if their direct dependencies did not
- change. Like with the same option on `dvc run`, this is a way to force stages
- without changes to run again. This can also be useful for pipelines containing
- stages that produce nondeterministic (semi-random) outputs. For
- nondeterministic stages the outputs can vary on each execution, meaning the
+ change. Like with the same option on `dvc run`, this is a way to force execute
+ stages without changes. This can also be useful for pipelines containing
+ stages that produce non-deterministic (semi-random) outputs. For
+ non-deterministic stages the outputs can vary on each execution, meaning the
cache cannot be trusted for such stages.
-- `--downstream` - only run again the stages after the given `targets` in their
+- `--downstream` - only execute the stages after the given `targets` in their
corresponding pipelines, including the target stages themselves.
- `-h`, `--help` - prints the usage/help message, and exit.
- `-q`, `--quiet` - do not write anything to standard output. Exit with 0 if all
- stages are up to date or if all stages are successfully run, otherwise exit
- with 1. The command run by the stage is free to make output irregardless of
- this flag.
+ stages are up to date or if all stages are successfully executed, otherwise
+ exit with 1. The command defined in the stage is free to write output
+ irregardless of this flag.
- `-v`, `--verbose` - displays detailed tracing information.
## Examples
-For simplicity, let's build a pipeline defined below (if you want get your hands
-on something more real, see this
-[mini-tutorial](/doc/get-started/example-pipeline)). It takes this `text.txt`
-file:
+For simplicity, let's build a pipeline defined below. (If you want get your
+hands on something more real, see this shot
+[pipeline tutorial](/doc/get-started/example-pipeline)). It takes this
+`text.txt` file:
```
dvc
@@ -162,7 +166,7 @@ $ tree
βββ count.txt <---- result: "2"
βββ filter.dvc <---- first stage
βββ numbers.txt <---- intermediate result of the first stage
-βββ process.py <---- code that runs some transformation
+βββ process.py <---- code that causes data transformation
βββ text.txt <---- text file to process
```
diff --git a/static/docs/commands-reference/root.md b/static/docs/commands-reference/root.md
index c8af612021..22ec39b3b5 100644
--- a/static/docs/commands-reference/root.md
+++ b/static/docs/commands-reference/root.md
@@ -1,6 +1,6 @@
# root
-Returns relative path to project's directory.
+Returns relative path to the DVC project.
## Synopsis
@@ -10,10 +10,10 @@ usage: dvc root [-h] [-q | -v]
## Description
-While in project's sub-directory, sometimes developers may want to refer some
-file belonging to another directory. This command returns relative path to the
-DVC project's root directory from the current working directory. So, this
-command can be used to build a path to a dependency file, command, or output.
+While in sub-directories of the project, sometimes developers may want to refer
+some file belonging to another directory. This command returns relative path to
+the project root from the current working directory. So this command can be used
+to build a path to a dependency file, command, or output.
## Options
diff --git a/static/docs/commands-reference/run.md b/static/docs/commands-reference/run.md
index abbd014b67..4ef369323f 100644
--- a/static/docs/commands-reference/run.md
+++ b/static/docs/commands-reference/run.md
@@ -18,38 +18,36 @@ positional arguments:
## Description
-`dvc run` provides an interface to build a computational graph (a.k.a.
-pipeline). It's a way to describe commands, data inputs and intermediate results
-that go into creating a ML model (or other data results). By explicitly
-specifying a list of dependencies (with `-d` option) and outputs (with `-o`,
-`-O`, `-m`, or `-M` options) DVC can connect each individual stage (command)
-into a directed acyclic graph (DAG). All the remainder of command-line input
-provided to `dvc run` after the optional arguments (`-` or `--` dashed options)
-will become the required `command` argument.
-
-> Remember to wrap the `command` with `"` quotes if there are special characters
-> in it like `|` (pipe) or `<`, `>` (redirection) that would otherwise apply to
-> the entire `dvc run` command. E.g.
-> `dvc run -d script.sh "./script.sh > /dev/null 2>&1"` Use single quotes `'`
-> instead of `"` to wrap the `command` if there are environment variables in it,
-> that you want to be evaluated dynamically. E.g.
-> `dvc run -d script.sh './myscript.sh $MYENVVAR'`
+`dvc run` provides an interface to describe stages: individual commands and the
+data inputs and outputs that go into creating a data result. By specifying a
+list of dependencies (`-d` option) and outputs (`-o`, `-O`, `-m`, or `-M`
+options) DVC can later connect each stage by building a dependency graph
+([DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph)). This graph is
+used by DVC to restore a full data [pipeline](/doc/commands-reference/pipeline).
+
+The remainder of command-line input provided to `dvc run` after the options (`-`
+or `--` arguments) will become the required `command` argument. Please wrap the
+`command` with `"` quotes if there are special characters in it like `|` (pipe)
+or `<`, `>` (redirection) that would otherwise apply to the entire `dvc run`
+command e.g. `dvc run -d script.sh "./script.sh > /dev/null 2>&1"`. Use single
+quotes `'` instead of `"` to wrap the `command` if there are environment
+variables in it, that you want to be evaluated dynamically. E.g.
+`dvc run -d script.sh './myscript.sh $MYENVVAR'`
Unless the `-f` options is used, by default the DVC-file name generated is
`.dvc`, where `` is file name of the first output (`-o`, `-O`, `-m`,
or `-M` option). If neither `-f`, nor outputs are specified, the stage name
defaults to `Dvcfile`.
-Since `dvc run` provides a way to build a graph of computations, using
-dependencies and outputs to connect different stages it checks computational
-graph integrity properties before creating a new stage. For example, for every
-output there should be only one stage that explicitly specifies it. There should
-be no cycles, etc.
+Since `dvc run` provides a way to build a dependency graph using dependencies
+and outputs to connect different stages, it checks the graph's integrity before
+creating a new stage. For example, for every output there should be only one
+stage that explicitly specifies it. There should be no cycles, etc.
Note that `dvc repro` provides an interface to check state and reproduce this
-graph later. This concept is similar to the one of the `Makefile` but DVC
-captures data and caches data artifacts along the way. See this
-[example](/doc/get-started/example-pipeline) to learn more and try to build a
+graph (pipeline) later. This concept is similar to the one of the `Makefile` but
+DVC captures data and caches data artifacts along the way. See this
+[example](/doc/get-started/example-pipeline) to learn more and try to create a
pipeline.
## Options
@@ -60,20 +58,18 @@ pipeline.
configuration file. DVC also supports certain
[external dependencies](/doc/user-guide/external-dependencies).
- DVC builds a computation graph and this list of dependencies is a way to
- connect different stages with each other. When you run `dvc repro` to
- reproduce a stage (or when a stage is reproduced due to recursive dependency),
- the list of dependencies helps DVC analyze whether any dependencies have
- changed and thus running the stage again is required. A special case is when
- no dependencies are specified.
+ DVC builds a dependency graph connecting different stages with each other.
+ When you run `dvc repro`, the list of dependencies helps DVC analyze whether
+ any dependencies have changed and thus executing stages as required to
+ regenerate their output. A special case is when no dependencies are specified.
> Note that a DVC-file without dependencies is considered always _changed_, so
> `dvc repro` always executes it.
- `-o`, `--outs` - specify a file or a directory that are results of running the
command. Multiple outputs can be specified like this:
- `-o model.pkl -o output.log`. DVC is building a computation graph and this
- list of outputs (along with dependencies described above) is a way to connect
+ `-o model.pkl -o output.log`. DVC is building a dependency graph and this list
+ of outputs (along with dependencies described above) is a way to connect
different stages with each other. DVC takes all output files and directories
under its control and will put them into the cache (this is similar to what's
happening when you run `dvc add`).
@@ -115,11 +111,12 @@ pipeline.
is used by `dvc repro` to change the working directory before running the
command.
-- `--no-exec` - create a stage file, but do not run the command specified nor
- take dependencies or outputs under DVC control. In the DVC-file contents, the
- `md5` hash sums will be empty; They will be populated the next time this stage
- is actually executed. This command is useful, if for example, you need to
- build a pipeline (computational graph) first, and then run it all at once.
+- `--no-exec` - create a stage file, but do not execute the command defined in
+ it, nor take dependencies or outputs under DVC control. In the DVC-file
+ contents, the `md5` hash sums will be empty; They will be populated the next
+ time this stage is actually executed. This command is useful, if for example,
+ you need to build a pipeline (dependency graph) first, and then run it all at
+ once.
- `-y`, `--yes` - deprecated, use `--overwrite-dvcfile` instead.
@@ -129,21 +126,20 @@ pipeline.
- `--ignore-build-cache` - if an exactly equal DVC-file exists (same list of
outputs and inputs, the same command to run) which has been already executed,
- and is up to date, with option `dvc run` won't execute the command again by
- default (thus "build cache"). This option gives a way to forcefully run the
- command anyway. It's useful if the command is considered non-deterministic for
- some reason (meaning it produces different outputs from the same list of
- inputs).
+ and is up to date, `dvc run` won't normally execute the command again (thus
+ "build cache"). This option gives a way to forcefully execute the command
+ anyway. It's useful if the command is non-deterministic (meaning it produces
+ different outputs from the same list of inputs).
-- `--remove-outs` - it removes stage outputs before running the command. If
+- `--remove-outs` - it removes stage outputs before executing the command. If
`--no-exec` specified outputs are removed anyway. This option is enabled by
default and deprecated. See `dvc remove` as well for more details.
- `--no-commit` - do not save outputs to cache. A DVC-file is created, and an
- entry is added to `.dvc/state`, while nothing is added to the cache. Use
- `dvc commit` when you are ready to save your results to cache. Useful when
- running different experiments and you don't want to fill up your cache with
- temporary files.
+ entry is added to `.dvc/state`, while nothing is added to the cache. Useful
+ when running different experiments and you don't want to fill up your cache
+ with temporary files. Use `dvc commit` when ready to commit the results to
+ cache.
> The `dvc status` command will mention that the file is `not in cache`.
diff --git a/static/docs/commands-reference/status.md b/static/docs/commands-reference/status.md
index de4601af82..2e62d309bd 100644
--- a/static/docs/commands-reference/status.md
+++ b/static/docs/commands-reference/status.md
@@ -2,8 +2,8 @@
Show changes in the project
[pipelines](/doc/commands-reference/pipeline), as well as mismatches either
-between the local cache and local files, or between the local cache and remote
-cache.
+between the cache and workspace files, or between the
+cache and remote storage.
## Synopsis
@@ -19,17 +19,17 @@ positional arguments:
## Description
`dvc status` searches for changes in the existing pipelines, either showing
-which [stages](/doc/commands-reference/run) have changed in the
-workspace and must be reproduced (with `dvc repro`), or differences
-between local vs. remote cache (meaning `dvc push` or `dvc pull`
-should be run to synchronize them). The two modes, _local_ and _cloud_ are
-triggered by using the `--cloud` or `--remote` options:
-
-| Mode | CLI Option | Description |
-| ------ | ---------- | ----------------------------------------------------------------------------------------------------------------------------- |
-| local | _none_ | Comparisons are made between data files in the workspace and corresponding files in the local cache (`.dvc/cache`) |
-| remote | `--remote` | Comparisons are made between the local cache, and the given remote. Remote caches are defined using the `dvc remote` command. |
-| remote | `--cloud` | Comparisons are made between the local cache, and the default remote, defined with `dvc remote --default` command. |
+which [stages](/doc/commands-reference/run) have changed in the workspace and
+must be reproduced (with `dvc repro`), or differences between cache vs. remote
+storage (meaning `dvc push` or `dvc pull` should be run to synchronize them).
+The two modes, _local_ and _cloud_ are triggered by using the `--cloud` or
+`--remote` options:
+
+| Mode | CLI Option | Description |
+| ------ | ---------- | --------------------------------------------------------------------------------------------------------------------------- |
+| local | _none_ | Comparisons are made between data files in the workspace and corresponding files in the cache directory (e.g. `.dvc/cache`) |
+| remote | `--remote` | Comparisons are made between the cache, and the given remote. Remote storage is defined using the `dvc remote` command. |
+| remote | `--cloud` | Comparisons are made between the cache, and the default remote, defined with `dvc remote --default` command. |
DVC determines data and code files to compare by analyzing all
[DVC-files](/doc/user-guide/dvc-file-format) in the project
@@ -50,7 +50,7 @@ Data and pipelines are up to date.
```
This indicates that no differences were detected, and therefore no stages would
-be run again by `dvc repro`.
+be executed by `dvc repro`.
If instead, differences are detected, `dvc status` lists those changes. For each
DVC-file (stage) with differences, the changes in _dependencies_ and/or
@@ -83,16 +83,16 @@ outputs described in it.
the DVC-file is up to date, but there is no corresponding cache
entry.
-**For comparison against a remote cache:**
+**For comparison against remote storage:**
-- _new_ means the file exists in the local cache but not the remote cache
-- _deleted_ means the file doesn't exist in the local cache, but exists in the
- remote cache
+- _new_ means that the file/directory exists in the cache but not in remote
+ storage.
+- _deleted_ means that the file/directory doesn't exist in the cache, but exists
+ in remote storage.
-For either the _new_ and _deleted_ cases, the local cache (subset of it
-determined by the current workspace) is different from the remote cache.
-Bringing the two into sync requires `dvc pull` or `dvc push` to synchronize the
-DVC cache. For the typical process to update the workspace, see
+For either _new_ and _deleted_ data, the cache (subset determined by the current
+workspace) is different from remote storage. Bringing the two into sync requires
+`dvc pull` or `dvc push`. For the typical process to update the workspace, see
[Share Data And Model Files](/doc/use-cases/share-data-and-model-files).
## Options
@@ -104,23 +104,24 @@ DVC cache. For the typical process to update the workspace, see
will not show changes occurring in later stages than the `targets`. Applies
whether or not `--cloud` is specified.
-- `-c`, `--cloud` - enables comparison against a remote cache. If no `--remote`
- option has been given, DVC will compare against the default remote cache,
- which is specified in the `core.remote` config option. Otherwise the
+- `-c`, `--cloud` - enables comparison against a remote. (See `dvc remote`.). If
+ no `--remote` option has been given, DVC will compare against the default
+ remote (specified in the `core.remote` config option). Otherwise the
comparison will be against the remote specified in the `--remote` option.
- `-r REMOTE`, `--remote REMOTE` - specifies which remote storage (see
`dvc remote list`) to compare against. The argument, `REMOTE`, is a remote
name defined using the `dvc remote` command. Implies `--cloud`.
-- `-a`, `--all-branches` - compares cache content against all Git branches.
- Instead of checking just the current workspace version, it runs the same
- status command in all the branches of this repo. The corresponding branches
- are shown in the status output. Applies only if `--cloud` or a `-r` remote is
- specified.
+- `-a`, `--all-branches` - compares cache content against all Git branches
+ instead of checking just the current workspace version. This basically runs
+ the same status command in all the branches of this repo. The corresponding
+ branches are shown in the status output. Applies only if `--cloud` or a `-r`
+ remote is specified.
- `-T`, `--all-tags` - compares cache content against all Git tags instead of
- checking just the current workspace version. The corresponding tags are shown
+ checking just the current workspace version. This basically runs the same
+ status command in all the tags of this repo. The corresponding tags are shown
in the status output. Applies only if `--cloud` or a `-r` remote is specified.
- `-j JOBS`, `--jobs JOBS` - specifies the number of jobs DVC can use to
@@ -130,7 +131,7 @@ DVC cache. For the typical process to update the workspace, see
- `-h`, `--help` - prints the usage/help message, and exit.
- `-q`, `--quiet` - do not write anything to standard output. Exit with 0 if
- Pipelines are up to date, otherwise 1.
+ data and pipelines are up to date, otherwise 1.
- `-v`, `--verbose` - displays detailed tracing information.
@@ -184,15 +185,14 @@ what files we have generated but haven't pushed to the remote yet:
```dvc
$ dvc remote list
-rcache s3://dvc-remote
+storage s3://dvc-remote
```
And would like to check what files we have generated but haven't pushed to the
remote yet:
```dvc
-$ dvc status --remote rcache
-
+$ dvc status --remote storage
Preparing to collect status from s3://dvc-remote
[##############################] 100% Collecting information
new: data/model.p
@@ -201,5 +201,5 @@ Preparing to collect status from s3://dvc-remote
new: data/matrix-test.p
```
-The output shows where the location of the remote cache as well as any
-differences between the local cache and remote cache.
+The output shows where the location of the remote storage is, as well as any
+differences between the cache and `storage` remote.
diff --git a/static/docs/get-started/add-files.md b/static/docs/get-started/add-files.md
index 7fba170847..ba3e2ec22b 100644
--- a/static/docs/get-started/add-files.md
+++ b/static/docs/get-started/add-files.md
@@ -16,8 +16,8 @@ $ wget https://data.dvc.org/get-started/data.xml -O data/data.xml
If you experienced problems using `wget` or you're on Windows and you don't want
to install it, you'll need to use a browser to download `data.xml` and save it
into `data` subdirectory. To download, right-click
-[this link](https://data.dvc.org/get-started/data.xml) and click `Save link as` (Chrome) or
-`Save object as` (Firefox).
+[this link](https://data.dvc.org/get-started/data.xml) and click `Save link As`
+(Chrome) or `Save Object As` (Firefox).
diff --git a/static/docs/get-started/connect-code-and-data.md b/static/docs/get-started/connect-code-and-data.md
index d8a0ac2aaa..c070d5ee98 100644
--- a/static/docs/get-started/connect-code-and-data.md
+++ b/static/docs/get-started/connect-code-and-data.md
@@ -118,9 +118,8 @@ wdir: .
```
> `dvc run` is just the first of a set of DVC command required to generate a
-> [pipeline](/doc/get-started/pipeline) computational graph, or in other words,
-> instructions on how to build a ML model (data file) from previous data files
-> (or directories).
+> [pipeline](/doc/get-started/pipeline), or in other words, instructions on how
+> to build a ML model (data file) from previous data files (or directories).
We would recommend to read a few next chapters first, before switching to other
documents. Hopefully, `dvc run` and `dvc repro` will make more sense after
@@ -136,7 +135,8 @@ readable.
`-d src/prepare.py` and `-d data/data.xml` mean that the `prepare.dvc` stage
file depends on them to produce the result. When you run `dvc repro` next time
(see next chapter) DVC will automatically check these dependencies and decide
-whether this stage is up to date or or whether it requires rebuilding.
+whether this stage is up to date or or whether it should be executed to
+regenerate its outputs.
`-o data/prepared` specifies the output directory processed data will be put
into. The script creates two files in it β that will be used later to generate
diff --git a/static/docs/get-started/example-pipeline.md b/static/docs/get-started/example-pipeline.md
index 978ddb242f..38c854f824 100644
--- a/static/docs/get-started/example-pipeline.md
+++ b/static/docs/get-started/example-pipeline.md
@@ -168,9 +168,9 @@ is automatically added to the `.gitignore` file and a link is created into a
cache `.dvc/cache/a3/04afb96060aad90176268345e10355` to save it.
Two things are worth noticing here. First, by analyzing dependencies and outputs
-that DVC-files describe, we can restore the full chain (DAG) of commands we need
-to apply. This is important when you run `dvc repro` to reproduce the final or
-intermediate result.
+that DVC-files describe, we can restore the full series of commands (pipeline
+stages) we need to apply. This is important when you run `dvc repro` to
+reproduce the final or intermediate result.
Second, you should see by now that the actual data is stored in the `.dvc/cache`
directory, each file having a name in a form of an md5 hash. This cache is
@@ -237,9 +237,9 @@ $ dvc run -d code/evaluate.py -d data/model.pkl -d data/matrix-test.pkl \
### Expand to learn more about DVC internals
-By analyzing dependencies and outputs in DVC-files, we can restore the full
-chain of commands (DAG) we need to apply. This is important when you run
-`dvc repro` to reproduce the final or intermediate result.
+By analyzing dependencies and outputs in DVC-files, we can generate a dependency
+graph: a series of commands DVC needs to execute. `dvc repro` does this in order
+to restore a pipeline and reproduce its intermediate or final results.
`dvc pipeline show` helps to visualize pipelines (run it with `-c` option to see
actual commands instead of DVC-files):
@@ -357,9 +357,9 @@ By wrapping your commands with `dvc run` it's easy to integrate DVC into your
existing ML development pipeline/processes without any significant effort to
rewrite your code.
-The key step to notice is that DVC automatically derives the dependencies
-between the experiment stages and builds the dependency graph (DAG)
-transparently.
+The key detail to notice is that DVC automatically derives the dependencies
+between the defined stages by building dependency graphs that represent data
+pipelines.
Not only can DVC streamline your work into a single, reproducible environment,
it also makes it easy to share this environment by Git including the
diff --git a/static/docs/get-started/example-versioning.md b/static/docs/get-started/example-versioning.md
index 1c10d4ab5a..9f38ba2c60 100644
--- a/static/docs/get-started/example-versioning.md
+++ b/static/docs/get-started/example-versioning.md
@@ -16,9 +16,9 @@ this example is to give you some hands-on experience with a very basic scenario

We first train a classifier model using 1000 labeled images, then we double the
-number and run the training again. We capture both datasets and both results and
-show how to use `dvc checkout` along with `git checkout` to switch between
-different versions.
+number and retrain our model. We capture both datasets and both results and show
+how to use `dvc checkout` along with `git checkout` to switch between different
+versions.
The specific algorithm that is used to train and validate the classifier is not
important. No prior knowledge is required about Keras. We reuse the
@@ -43,8 +43,8 @@ $ git clone https://github.com/iterative/example-versioning.git
$ cd example-versioning
```
-This command pulls a repository with a single script `train.py` that runs the
-training.
+This command pulls a repository with a single script `train.py` that will train
+the model.
Now let's install the requirements. But before we do that, we **strongly**
recommend creating a virtual environment with a tool such as
@@ -207,7 +207,7 @@ data
βββ cat.1400.jpg
```
-Of course, we want to leverage these new labels and train the model again.
+Of course, we want to leverage these new labels and retrain the model.
```dvc
$ dvc add data
@@ -326,18 +326,18 @@ commands. Here we would like to outline some next topics and ideas you would be
interested to try to learn more about DVC and how it makes managing ML projects
simpler.
-First of all, you should have probably noticed that the script that trains a
-model is written in a monolithic way. It runs the `save_bottleneck_feature`
-function to pre-calculate bottom, "frozen" part of the net every time it is run.
-Features are written into files, and intention probably was that the
+First of all, you may have noticed that the script that trains the model is
+written in a monolithic way. It uses the `save_bottleneck_feature` function to
+pre-calculate bottom, "frozen" part of the net every time it is run. Features
+are written into files, and intention probably was that the
`save_bottleneck_feature` can be commented out after the first run. It's not
very convenient to remember to comment/uncomment it every time dataset is
changed.
-Here where DVC pipelines feature comes very handy and was designed for. We
-touched it briefly when we described `dvc run` and `dvc repro` at the very end.
-The next step here would be splitting the script into two parts, and utilizing
-DVC [pipelines](/doc/commands-reference/pipeline). See
+Here's where the [pipelines](/doc/commands-reference/pipeline) feature of DVC
+comes very handy and was designed for. We touched it briefly when we described
+`dvc run` and `dvc repro` at the very end. The next step here would be splitting
+the script into two parts, and utilizing pipelines. See
[this example](/doc/get-started/example-pipeline) to get a hands-on experience
with pipelines and try to apply it here. Don't hesitate to join our
[community](/chat) to ask any questions!
diff --git a/static/docs/get-started/initialize.md b/static/docs/get-started/initialize.md
index ab36dfee7f..55bd42c97a 100644
--- a/static/docs/get-started/initialize.md
+++ b/static/docs/get-started/initialize.md
@@ -23,7 +23,7 @@ $ git commit -m "Initialize DVC project"
```
After DVC initialization, a new directory `.dvc/` will be created with `config`
-and `.gitignore` files, and cache directory. These files and
+and `.gitignore` files, and cache directory. These files and
directories are hidden from the user generally and are not meant to be
manipulated directly.
diff --git a/static/docs/get-started/metrics.md b/static/docs/get-started/metrics.md
index 9bc690bc3e..00828c4b1d 100644
--- a/static/docs/get-started/metrics.md
+++ b/static/docs/get-started/metrics.md
@@ -22,7 +22,7 @@ with a single number inside.
> Please, refer to the `dvc metrics` command documentation to see more available
> options and details.
-Let's again commit and save results:
+Let's save the updated results:
```dvc
$ git add evaluate.dvc auc.metric
diff --git a/static/docs/get-started/pipeline.md b/static/docs/get-started/pipeline.md
index 0155cd1059..3d45608e38 100644
--- a/static/docs/get-started/pipeline.md
+++ b/static/docs/get-started/pipeline.md
@@ -5,7 +5,7 @@ difference between DVC and other version control tools that can handle large
data files (e.g. `git lfs`). By using `dvc run` multiple times, and specifying
outputs of a command (stage) as dependencies in another one, we can describe a
sequence of commands that gets to a desired result. This is what we call a
-**data pipeline** or computational graph.
+**data pipeline** or dependency graph.
Let's create a second stage (after `prepare.dvc`, created in the previous
chapter) to perform feature extraction:
diff --git a/static/docs/get-started/reproduce.md b/static/docs/get-started/reproduce.md
index 642f3db9d3..4f18d78175 100644
--- a/static/docs/get-started/reproduce.md
+++ b/static/docs/get-started/reproduce.md
@@ -1,11 +1,11 @@
# Reproduce
-In the previous chapters, we described our first pipeline. Basically, we created
-a number of [stage files](/doc/commands-reference/run). Each of these
-[DVC-files](/doc/user-guide/dvc-file-format) describes single stage we need to
-run towards a final result (a [pipeline]](/doc/commands-reference/pipeline)).
-Each depends on some data (either raw data files or intermediate results from
-previous stages) and code files.
+In the previous chapters, we described our first
+[pipeline]](/doc/commands-reference/pipeline). Basically, we generated a number
+of [stage files](/doc/commands-reference/run)
+([DVC-files](/doc/user-guide/dvc-file-format)). These stages define individual
+commands to execute towards a final result. Each depends on some data (either
+raw data files or intermediate results from previous stages) and code files.
If you just cloned the
[project](https://github.com/iterative/example-get-started), make sure you first
@@ -19,8 +19,8 @@ $ dvc repro train.dvc
```
> If you've just followed the previous chapters, the command above will have
-> nothing to reproduce since you've already run all the pipeline stages. To
-> easily try this command, clone this example
+> nothing to reproduce since you've recently executed all the pipeline stages.
+> To easily try this command, clone this example
> [Github project](https://github.com/iterative/example-get-started) and run it
> from there.
@@ -31,9 +31,9 @@ that includes the data file in its outputs, get dependencies and commands, and
so on. It means that DVC can recursively build a complete tree of commands it
needs to execute to get the model file.
-`dvc repro` is, essentially, building this execution graph, detects stages with
-modified dependencies or missing outputs and recursively executes this graph
-starting from these stages.
+`dvc repro` essentially builds a dependency graph, detects stages with modified
+dependencies or missing outputs and recursively executes commands (nodes in this
+graph or pipeline) starting from the first stage with changes.
Thus, `dvc run` and `dvc repro` provide a powerful framework for _reproducible
experiments_ and _reproducible projects_.
diff --git a/static/docs/tutorial/define-ml-pipeline.md b/static/docs/tutorial/define-ml-pipeline.md
index 0463e49234..822e150c08 100644
--- a/static/docs/tutorial/define-ml-pipeline.md
+++ b/static/docs/tutorial/define-ml-pipeline.md
@@ -66,7 +66,9 @@ If you take a look at the [DVC-file](/doc/user-guide/dvc-file-format) created by
`dvc add`, you will see that only outputs are defined in `outs`. In this file,
only one output is defined. The output contains the data file path in the
repository and md5 checksum. This checksum determines a location of the actual
-content file in the cache directory, `.dvc/cache`.
+content file in the
+[cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory),
+`.dvc/cache`.
```dvc
$ cat data/Posts.xml.zip.dvc
@@ -81,12 +83,12 @@ $ du -sh .dvc/cache/ec/*
```
> Outputs from DVC-files define the relationship between the data file path in a
-> repository and the path in a cache directory.
+> repository and the path in the cache directory.
-Keeping actual file content in a cache directory and a copy of the caches in the
-user workspace during `$ git checkout` is a regular trick that
-[Git-LFS](https://git-lfs.github.com/) (Git for Large File Storage) uses. This
-trick works fine for tracking small files with source code. For large data
+Keeping actual file contents in the cache, and a copy of the cached
+file in the workspace during `$ git checkout` is a regular trick
+that [Git-LFS](https://git-lfs.github.com/) (Git for Large File Storage) uses.
+This trick works fine for tracking small files with source code. For large data
files, this might not be the best approach, because of _checkout_ operation for
a 10Gb data file might take several seconds and a 50GB file checkout (think
copy) might take a few minutes.
@@ -186,12 +188,12 @@ command. `-d data/Posts.xml.zip` defines the input file and `-o data/Posts.xml`
the resulting extracted data file.
The `unzip` command extracts data file `data/Posts.xml.zip` to a regular file
-`data/Posts.xml`. It knows nothing about data files or DVC. DVC runs the command
-and does some additional work if the command was successful:
+`data/Posts.xml`. It knows nothing about data files or DVC. DVC executes the
+command and does some additional work if the command was successful:
1. DVC transforms all the outputs `-o` files into data files. It is like
applying `dvc add` for each of the outputs. As a result, all the actual data
- files content goes to the cache directory `.dvc/cache` and each
+ files content goes to the cache directory `.dvc/cache` and each
of the file names will be added to `.gitignore`.
2. For reproducibility purposes, `dvc run` creates the `Posts.xml.dvc` stage
@@ -266,7 +268,7 @@ A single stage of our ML pipeline was defined and committed into repository. It
isn't necessary to commit stages right after their creation. You can create a
few and commit them to Git together later.
-Letβs run the following stages: converting an XML file to TSV, and then
+Letβs create the following stages: converting an XML file to TSV, and then
separating training and testing datasets:
```dvc
@@ -398,4 +400,4 @@ focus is DVC, not ML modeling and we use a relatively small dataset without any
advanced ML techniques.
In the next chapter we will try to improve the metrics by changing our modeling
-code and using reproducibility in our pipeline regeneration.
+code and using reproducibility in our pipeline.
diff --git a/static/docs/tutorial/index.md b/static/docs/tutorial/index.md
index 0b3722f7c1..eb80b091f6 100644
--- a/static/docs/tutorial/index.md
+++ b/static/docs/tutorial/index.md
@@ -25,7 +25,7 @@ and this approach will not require storing binary files in your Git repository.
## DVC Workflow
-The diagram below describes all the DVC commands and relationships between local
-cache and remote storage.
+The diagram below describes all the DVC commands and relationships between a
+local cache and remote storage.

diff --git a/static/docs/tutorial/preparation.md b/static/docs/tutorial/preparation.md
index 8f0c9f5711..394c763448 100644
--- a/static/docs/tutorial/preparation.md
+++ b/static/docs/tutorial/preparation.md
@@ -68,7 +68,7 @@ DVC works on top of Git repositories. You run DVC initialization in a repository
directory to create DVC meta files and directories.
After DVC initialization, a new directory `.dvc/` will be created with `config`
-and `.gitignore` files, and cache directory. These files and
+and `.gitignore` files, and cache directory. These files and
directories are hidden from the user generally and are not meant to be
manipulated directly. However, we describe some DVC internals below for a better
understanding of how it works.
diff --git a/static/docs/tutorial/reproducibility.md b/static/docs/tutorial/reproducibility.md
index b60fe34830..8a07fac689 100644
--- a/static/docs/tutorial/reproducibility.md
+++ b/static/docs/tutorial/reproducibility.md
@@ -10,14 +10,14 @@ The most exciting part of DVC is reproducibility.
DVC tracks all the dependencies, which helps you iterate on ML models faster
without thinking what was affected by your last change.
-> In order to track all the dependencies, DVC finds and reads ALL the DVC-files
-> in a repository and builds a dependency graph
-> ([DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph)) based on these
-> files.
+In order to track all the dependencies, DVC finds and reads all the DVC-files in
+a repository and builds a dependency graph
+([pipeline](/doc/commands-reference/pipeline)) based on these files.
This is one of the differences between DVC reproducibility and traditional
Makefile-like build automation tools (Make, Maven, Ant, Rakefile etc). It was
-designed in such a way to localize specification of DAG nodes.
+designed in such a way to localize specification of the graph nodes (pipeline
+[stages](/doc/commands-reference/run)).
If you run `repro` on any [DVC-file](/doc/user-guide/dvc-file-format) from our
repository, nothing happens because nothing was changed in the pipeline defined
@@ -86,7 +86,7 @@ Reproducing 'Dvcfile':
The process started with the feature creation stage because one of its
parameters was changed β the edited source code file `code/featurization.py`.
-All dependent stages were ran again as well.
+All dependent stages were executed as well.
Letβs take a look at the metricβs change. The improvement is close to zero
(+0.0075% to be precise):
@@ -181,8 +181,7 @@ clf = RandomForestClassifier(n_estimators=700,
n_jobs=6, random_state=seed)
```
-Only the modeling and the evaluation stage need to be reproduced. Just run
-repro:
+Only the modeling and the evaluation stage need to be reproduced. Just run:
```dvc
$ dvc repro
diff --git a/static/docs/tutorial/sharing-data.md b/static/docs/tutorial/sharing-data.md
index 0b1403004e..1f8f72670f 100644
--- a/static/docs/tutorial/sharing-data.md
+++ b/static/docs/tutorial/sharing-data.md
@@ -7,13 +7,13 @@ repositories. These repositories will contain all the information needed for
reproducibility and it might be a good idea to share these DVC-repositories
using GitHub or other Git services.
-DVC is able to push the cache to a cloud.
+DVC is able to push the cache to cloud storage.
-> Using your shared cache a colleague can reuse ML models that were trained on
-> your machine.
+> Using shared cloud storage, a colleague can reuse ML models that were trained
+> on your machine.
-First, you need to set a data remote which will be stored in the config file of
-the project. This can be done using the CLI as shown below.
+First, you need to set a remote storage which will be stored in the config file
+of the project. This can be done using the CLI as shown below.
> Note that we are using the `dvc-public` S3 bucket as an example and you don't
> have write access to it, so in order to follow the tutorial you will need to
@@ -28,7 +28,7 @@ $ git status -s
M .dvc/config
```
-Then, a simple command pushes files from your local cache to the cloud:
+Then, a simple command pushes files from your cache to the cloud:
```dvc
$ dvc push
diff --git a/static/docs/understanding-dvc/collaboration-issues.md b/static/docs/understanding-dvc/collaboration-issues.md
index cd316ca06a..c6f71c8045 100644
--- a/static/docs/understanding-dvc/collaboration-issues.md
+++ b/static/docs/understanding-dvc/collaboration-issues.md
@@ -29,8 +29,8 @@ principled way:
- How do you recover a model from last week without wasting time waiting for the
model to retrain?
-- How do you quickly switch between the large data source and a small data
- subset without modifying source code?
+- How do you quickly switch between the large dataset and a small subset without
+ modifying source code?
4. Reproducibility.
diff --git a/static/docs/understanding-dvc/core-features.md b/static/docs/understanding-dvc/core-features.md
index eb49eaa9b1..79dbeb1e54 100644
--- a/static/docs/understanding-dvc/core-features.md
+++ b/static/docs/understanding-dvc/core-features.md
@@ -4,10 +4,11 @@
interface and Git workflow.
2. It makes data science projects **reproducible** by creating lightweight
- pipelines of DAGs.
+ [pipelines](/doc/commands-reference/pipeline) using implicit dependency
+ graphs.
3. **Large data file versioning** works by creating pointers in your Git
- repository to the data cache on a local hard drive.
+ repository to the cache, typically stored on a local hard drive.
4. **Programming language agnostic**: Python, R, Julia, shell scripts, etc. ML
library agnostic: Keras, Tensorflow, PyTorch, scipy, etc.
diff --git a/static/docs/understanding-dvc/existing-tools.md b/static/docs/understanding-dvc/existing-tools.md
index 9d9b9eec97..370024af2d 100644
--- a/static/docs/understanding-dvc/existing-tools.md
+++ b/static/docs/understanding-dvc/existing-tools.md
@@ -6,7 +6,7 @@ There is one common opinion regarding data science tooling. Data scientists as
engineers are supposed to use the best practices and collaboration software from
software engineering. Source code version control system (Git), continuous
integration services (CI), and unit test frameworks are all expected to be
-utilized in data science pipelines.
+utilized in data science [pipelines]](/doc/commands-reference/pipeline).
But a comprehensive look at data science processes shows that the software
engineering toolset does not cover data science needs. Try to answer all the
diff --git a/static/docs/understanding-dvc/how-it-works.md b/static/docs/understanding-dvc/how-it-works.md
index a5f6dc6980..039dee4fca 100644
--- a/static/docs/understanding-dvc/how-it-works.md
+++ b/static/docs/understanding-dvc/how-it-works.md
@@ -48,8 +48,8 @@
```dvc
$ git checkout a03_normbatch_vgg16 # checkout code and DVC-files
- $ dvc checkout # checkout data files from the local cache (not Git)
- $ ls -l data/ # These LARGE files were copied from DVC cache, not from Git
+ $ dvc checkout # checkout data files from the cache
+ $ ls -l data/ # These LARGE files came from the cache, not from Git
total 1017488
-r-------- 2 501 staff 273M Jan 27 03:48 Posts-test.tsv
@@ -72,17 +72,17 @@
Rscript plot.R result.csv plots.jpg
```
-7. DVC's local cache can be transferred to your colleagues and partners through
- AWS S3, Azure Blob Storage or GCP Storage:
+7. The cache of a DVC project can be shared with your colleagues and partners
+ through AWS S3, Azure Blob Storage GCP Storage, among others:
```dvc
$ git push
- $ dvc push # push the data cache to the remote storage
+ $ dvc push # push from the cache to remote storage
# On a colleague machine:
$ git clone https://github.com/dataversioncontrol/myrepo.git
$ cd myrepo
- $ git pull # get the data cache from cloud
+ $ git pull # download tracked data from remote storage
$ dvc checkout # checkout data files
$ ls -l data/ # You just got gigabytes of data through Git and DVC:
diff --git a/static/docs/understanding-dvc/related-technologies.md b/static/docs/understanding-dvc/related-technologies.md
index 3e9035fa7d..01bfe42d8e 100644
--- a/static/docs/understanding-dvc/related-technologies.md
+++ b/static/docs/understanding-dvc/related-technologies.md
@@ -13,8 +13,10 @@ process.
should NOT be stored in a Git repository but still need to be tracked and
versioned.
-2. **Workflow management tools** (pipelines and DAGs): Airflow, Luigi, etc. The
- differences are:
+2. **Workflow management tools** ([pipelines]](/doc/commands-reference/pipeline)
+ and dependency graphs
+ ([DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph))): Airflow,
+ Luigi, etc. The differences are:
- DVC is focused on data science and modeling. As a result, DVC pipelines are
lightweight, easy to create and modify. However, DVC lacks pipeline execution
@@ -34,10 +36,10 @@ process.
- DVC doesn't need to run any services. No graphical user interface as a result,
but we expect some GUI services will be created on top of DVC.
-- DVC has transparent design:
- [meta files and directories](/doc/user-guide/dvc-files-and-directories)
- (including the data cache) have a human-readable format and can
- be easily reused by external tools.
+- DVC has transparent design. Its
+ [internal files and directories](/doc/user-guide/dvc-files-and-directories)
+ (including the cache directory) have a human-readable format and
+ can be easily reused by external tools.
4. **Git workflows** and Git usage methodologies such as Gitflow. The
differences are:
@@ -51,18 +53,22 @@ process.
5. **Makefile** (and it's analogues). The differences are:
-- DVC utilizes a DAG:
+- DVC utilizes a
+ [directed acyclic graph](https://en.wikipedia.org/wiki/Directed_acyclic_graph)
+ (DAG):
- - The DAG is defined by [DVC-files](/doc/user-guide/dvc-file-format) (with
- file names `.dvc` or `Dvcfile`).
+ - The DAG or dependency graph is defined by the connections between
+ [DVC-file](/doc/user-guide/dvc-file-format) (with file names `.dvc` or
+ `Dvcfile`), based on their dependencies and outputs.
- - One DVC-file defines one node in the DAG. All DVC-files in a repository make
- up a single pipeline (think a single Makefile). All DVC-files (and
+ - Each DVC-file defines one node in the DAG. All DVC-files in a repository
+ make up a single pipeline (think a single Makefile). All DVC-files (and
corresponding pipeline commands) are implicitly combined through their
inputs and outputs, to simplify conflict resolving during merges.
- - DVC provides a simple command `dvc run CMD` to generate a DVC-file
- automatically based on the provided command, dependencies, and outputs.
+ - DVC provides a simple command `dvc run` to generate a DVC-file or "stage
+ file" automatically, based on the provided command, dependencies, and
+ outputs.
- File tracking:
@@ -88,11 +94,12 @@ process.
- Git-annex is a datafile-centric system whereas DVC is focused on providing a
workflow for machine learning and reproducible experiments. When a DVC or
- Git-annex repository is cloned via git clone, data files won't be copied to
- the local machine as file content is stored in separate data remotes. However,
+ Git-annex repository is cloned via `git clone`, data files won't be copied to
+ the local machine as file contents are stored in separate
+ [remotes](/doc/commands-reference/remote). With DVC,
[DVC-files](/doc/user-guide/dvc-file-format) (which provide the reproducible
- workflow) are always included in the cloned Git repository and hence can be
- recreated locally with minimal effort.
+ workflow) are always included in the Git repository and hence can be recreated
+ locally with minimal effort.
- DVC is not fundamentally bound to Git, having the option of changing the
repository format.
diff --git a/static/docs/understanding-dvc/what-is-dvc.md b/static/docs/understanding-dvc/what-is-dvc.md
index b62c41c6a6..2cb6607401 100644
--- a/static/docs/understanding-dvc/what-is-dvc.md
+++ b/static/docs/understanding-dvc/what-is-dvc.md
@@ -22,8 +22,8 @@ DVC uses a few core concepts:
features, change model hyperparameters, data cleaning, add a new data source)
should be performed in a separate branch and then merged into the master
branch only if the experiment is successful. DVC allows experiments to be
- integrated into a project's history and NEVER needs to recompute the results
- after a successful merge.
+ integrated into a Git repository history and NEVER needs to recompute the
+ results after a successful merge.
- **Experiment state** or state: Equivalent to a Git snapshot (all committed
files). Git checksum, branch name, or tag can be used as a reference to a
@@ -33,9 +33,11 @@ DVC uses a few core concepts:
generates output files based on a set of input files and source code. This
action usually changes experiment state.
-- **Pipeline**: Directed acyclic graph (DAG) or chain of commands to reproduce
- an experiment state. The commands are connected by input and output files.
- Pipelines are defined by special **DVC-files** (which act like Makefiles).
+- **Pipeline**: Dependency graph or series of commands to reproduce data
+ processing results. The commands are connected by input and output files
+ (dependencies). Pipelines are defined by special
+ [stage files](/doc/commands-reference/run) (similar to Makefiles). Refer to
+ [pipeline]](/doc/commands-reference/pipeline) for more information.
- **Workflow**: Set of experiments and relationships among them. Workflow
corresponds to the entire Git repository.
@@ -45,8 +47,8 @@ DVC uses a few core concepts:
[DVC-files](/doc/user-guide/dvc-file-format) describing that data are stored
in Git for DVC needs (to maintain pipelines and reproducibility).
-- **Data cache**: Directory with all data files on a local hard drive or in
- cloud storage, but not in the Git repository.
+- **Cache directory**: Directory with all data files on a local hard drive or in
+ cloud storage, but not in the Git repository. See `dvc cache dir`.
- **Cloud storage** support: available complement to the core DVC features. This
is how a data scientist transfers large data files or shares a GPU-trained
diff --git a/static/docs/use-cases/multiple-data-scientists-on-a-single-machine.md b/static/docs/use-cases/multiple-data-scientists-on-a-single-machine.md
index 85efd5c08a..634a455785 100644
--- a/static/docs/use-cases/multiple-data-scientists-on-a-single-machine.md
+++ b/static/docs/use-cases/multiple-data-scientists-on-a-single-machine.md
@@ -30,11 +30,11 @@ permissions.
### Transfer existing cache (Optional)
-This step is optional. You can skip it if you are setting up a new DVC
-repository and don't have your local cache stored in `.dvc/cache`. If you did
-work on your project with DVC previously and you wish to transfer your cache to
-the shared cache directory (external to your workspace), you will need to simply
-move it from an old cache location to the new one:
+This step is optional. You can skip it if you are setting up a new DVC project
+whose cache directory is not stored in the default location, `.dvc/cache`. If
+you did work on your project with DVC previously and you wish to transfer your
+cache to the shared cache directory (external to your workspace), you will need
+to simply move it from an old cache location to the new one:
```dvc
$ mv .dvc/cache/* /path/to/dvc-cache
diff --git a/static/docs/user-guide/analytics.md b/static/docs/user-guide/analytics.md
index 7c91447e57..a15888f431 100644
--- a/static/docs/user-guide/analytics.md
+++ b/static/docs/user-guide/analytics.md
@@ -49,7 +49,7 @@ DVC's analytics are sent throughout DVC's proxy to Google Analytics over HTTPS.
## Opting out
-DVC analytics helps the entire community and leaving it on is appreciated.
+DVC analytics help the entire community, so leaving it on is appreciated.
However, if you want to opt out of DVC's analytics, you can disable it via
`dvc config` command:
diff --git a/static/docs/user-guide/contributing-documentation.md b/static/docs/user-guide/contributing-documentation.md
index 0466d99f16..207836a574 100644
--- a/static/docs/user-guide/contributing-documentation.md
+++ b/static/docs/user-guide/contributing-documentation.md
@@ -64,9 +64,10 @@ $ git clone git@github.com:/dvc.org.git
```
Make sure you have the latest version of [Node.js](https://nodejs.org/en/) and
-[yarn](https://yarnpkg.com/en/) installed. Install and keep the dependencies up
-to date by running `yarn` often. This will also enable the Git pre-commit hook
-that will be formatting your code and documentation files automatically.
+[yarn](https://yarnpkg.com/en/) installed. Install the dependencies by running
+`yarn`. (Run it continuously as the repository changes to keep the dependencies
+up to date.) This will also enable the Git pre-commit hook that will be
+formatting your code and documentation files automatically.
It's highly recommended to run the Node docs app locally to check documentation
changes before submitting them, and its very much needed in order to make
@@ -88,7 +89,8 @@ command before committing them.
Visual Studio Code and the
[Rewrap](https://marketplace.visualstudio.com/items?itemName=stkb.rewrap)
plugin. Correct formatting will be done automatically by a Git pre-commit hook
- which is integrated when `yarn` runs in the instructions above.
+ which is integrated when `yarn` installs the project dependencies (explained
+ in the instructions above).
- We use [Prettier](https://prettier.io/) default conventions to format our
source code files. The formatting of staged files will automatically be done
diff --git a/static/docs/user-guide/contributing.md b/static/docs/user-guide/contributing.md
index 5837a45247..6e20eef32e 100644
--- a/static/docs/user-guide/contributing.md
+++ b/static/docs/user-guide/contributing.md
@@ -141,23 +141,24 @@ $ pip install -e ".[ssh]"
$ pip install -e ".[all]"
```
-You will need to update your `ENV` throughout subsequent steps, so we created a
-template for you:
+You will need to update your environment throughout subsequent steps, so we
+created a template for you:
```dvc
$ cp tests/remotes_env.sample tests/remotes_env
```
Then uncomment lines you need and fill in needed values, the details are
-explained in remote specific subsections. To activate these env vars use:
+explained in remote specific subsections. To activate these environment
+variables, use:
```dvc
$ source tests/remotes_env
```
-If some member of your team had already went through all of this you may just
-ask for their `remotes_env` file and Google Cloud credentials and you can skip
-any manipulations with `ENV` below.
+If another member of your team has already gone through this guide, you could
+just ask for their `remotes_env` file and Google Cloud credentials, and skip any
+env manipulations below.
@@ -167,8 +168,8 @@ Install
[aws cli](https://docs.aws.amazon.com/en_us/cli/latest/userguide/cli-chap-install.html)
tools.
-Set up an account, get credentials, which will have access to S3. Then, set
-`ENV` vars like this:
+Set up an account, get credentials, which will have access to S3. Then, set env
+vars like this:
```dvc
$ export AWS_ACCESS_KEY_ID="...YOUR-ACCESS-KEY-ID..."
@@ -188,7 +189,7 @@ authenticated with your google account.
You then need to create a bucket, a service account and get its credentials. You
can do this via web UI or terminal. Then you need to put your keys to
-`scripts/ci/gcp-creds.json` and add these to your `ENV`:
+`scripts/ci/gcp-creds.json` and add these to your env vars:
```dvc
$ export GOOGLE_APPLICATION_CREDENTIALS=".gcp-creds.json"
@@ -231,7 +232,7 @@ $ mkdir azurite
$ azurite -s -l azurite -d azurite/debug.log
```
-Add this to your `ENV`:
+Add this to your env:
```dvc
$ export AZURE_STORAGE_CONTAINER_NAME="dvc-test"
diff --git a/static/docs/user-guide/dvc-file-format.md b/static/docs/user-guide/dvc-file-format.md
index 1f488525fb..ba60116aec 100644
--- a/static/docs/user-guide/dvc-file-format.md
+++ b/static/docs/user-guide/dvc-file-format.md
@@ -45,7 +45,7 @@ meta: # Special key to contain arbitary user data
On the top level, `.dvc` file consists of these fields:
-- `cmd`: Command that is being run in this stage
+- `cmd`: Executable command defined in this stage
- `deps`: List of dependencies for this stage
- `outs`: List of outputs for this stage
- `md5`: md5 checksum for this DVC-file
diff --git a/static/docs/user-guide/dvc-files-and-directories.md b/static/docs/user-guide/dvc-files-and-directories.md
index b7199d24ff..10944dc336 100644
--- a/static/docs/user-guide/dvc-files-and-directories.md
+++ b/static/docs/user-guide/dvc-files-and-directories.md
@@ -40,13 +40,13 @@ directory (`.dvc/`) with special internal files and directories:
- `.dvc/updater.lock`: Lock file for `.dvc/updater`
-- `.dvc/lock`: Lock file for the whole DVC project
+- `.dvc/lock`: Lock file for the entire DVC project
## Structure of cache directory
There are two ways in which the data is stored in cache. It depends
-on if the actual data is stored in a file (eg. `data.csv`) or it is a directory
-of files.
+on whether the actual data is stored in a single file (eg. `data.csv`) or in a
+directory of files.
We evaluate a checksum, usually MD5, for the data file which is a 32 characters
long string. The first two characters are assigned to name the directory inside
@@ -108,3 +108,5 @@ $ cat .dvc/cache/19/6a322c107c2572335158503c64bfba.dir
{"md5": "29a6c8271c0c8fbf75d3b97aecee589f", "relpath": "index.jpeg"}
]
```
+
+See also `dvc cache dir` to set the location of the cache directory.
diff --git a/static/docs/user-guide/external-dependencies.md b/static/docs/user-guide/external-dependencies.md
index 5d3f392f6e..387cab9ccc 100644
--- a/static/docs/user-guide/external-dependencies.md
+++ b/static/docs/user-guide/external-dependencies.md
@@ -1,11 +1,12 @@
# External Dependencies
-There are cases when data is large enough or processing is organized in a way
-that you would like to avoid moving data out of the remote storage. For example,
-you are processing data on HDFS, running Dask via SSH, or have a script that
-streams data from S3 to process it, etc. A mechanism of external dependencies
-and [External Outputs](/doc/user-guide/external-outputs) provides a way for DVC
-to control data externally.
+There are cases when data is so large, or its processing is organized in a way
+that you would like to avoid moving it out of its external/remote location. For
+example from a network attached storage (NAS) drive, processing data on HDFS,
+running [Dask](https://dask.org/) via SSH, or having a script that streams data
+from S3 to process it. A mechanism for external dependencies and
+[external outputs](/doc/user-guide/external-outputs) provides a way for DVC to
+control data externally.
## Description
diff --git a/static/docs/user-guide/external-outputs.md b/static/docs/user-guide/external-outputs.md
index a9a0c03654..3463db1e09 100644
--- a/static/docs/user-guide/external-outputs.md
+++ b/static/docs/user-guide/external-outputs.md
@@ -1,10 +1,11 @@
# Managing External Data
-There are cases when data is large enough or processing is organized in a way
-that you would like to avoid moving data out of the remote storage. For example,
-you are processing data on HDFS, running Dask via SSH, or have a script that
-streams data from S3 to process it, etc. A mechanism of external outputs and
-[External Dependencies](/doc/user-guide/external-dependencies) provides a way
+There are cases when data is so large, or its processing is organized in a way
+that you would like to avoid moving it out of its external/remote location. For
+example from a network attached storage (NAS) drive, processing data on HDFS,
+running [Dask](https://dask.org/) via SSH, or having a script that streams data
+from S3 to process it. A mechanism for external outputs and
+[external dependencies](/doc/user-guide/external-dependencies) provides a way
for DVC to control data externally.
## Description
@@ -31,7 +32,8 @@ pointing to your desired files. For cached external outputs (specified using
`-o`) you will need to
[setup an external cache](/doc/commands-reference/config#cache) location that
will be used by DVC to store versions of your external file. Non-cached external
-outputs (specified using `-O`) do not require external cache to be setup.
+outputs (specified using `-O`) do not require an external cache to
+be setup.
> Avoid using the same remote location that you are using for `dvc push`,
> `dvc pull`, `dvc fetch` as external cache for your external outputs, because
@@ -50,8 +52,8 @@ stage file (DVC-file).
### Local
-Your local cache location already defaults to `.dvc/cache`, so there is no need
-to specify it explicitly.
+The default local cache location is `.dvc/cache`, so there is no need to specify
+it explicitly.
```dvc
$ dvc add /home/shared/mydata
@@ -72,7 +74,7 @@ $ dvc config cache.s3 s3cache
# Add data on S3 directly
$ dvc add s3://mybucket/mydata
-# Run the stage with external S3 output
+# Create the stage with external S3 output
$ dvc run -d data.txt \
-o s3://mybucket/data.txt \
aws s3 cp data.txt s3://mybucket/data.txt
@@ -90,7 +92,7 @@ $ dvc config cache.gs gscache
# Add data on GS directly
$ dvc add gs://mybucket/mydata
-# Run the stage with external GS output
+# Create the stage with external GS output
$ dvc run -d data.txt \
-o gs://mybucket/data.txt \
gsutil cp data.txt gs://mybucket/data.txt
@@ -108,7 +110,7 @@ $ dvc config cache.ssh sshcache
# Add data on SSH directly
$ dvc add ssh://user@example.com:/mydata
-# Run the stage with external SSH output
+# Create the stage with external SSH output
$ dvc run -d data.txt \
-o ssh://user@example.com:/home/shared/data.txt \
scp data.txt user@example.com:/home/shared/data.txt
@@ -126,7 +128,7 @@ $ dvc config cache.hdfs hdfscache
# Add data on HDFS directly
$ dvc add hdfs://user@example.com/mydata
-# Run the stage with external HDFS output
+# Create the stage with external HDFS output
$ dvc run -d data.txt \
-o hdfs://user@example.com/home/shared/data.txt \
hdfs fs -copyFromLocal \
diff --git a/static/docs/user-guide/large-dataset-optimization.md b/static/docs/user-guide/large-dataset-optimization.md
index 70a7dd42ad..97aa6f5bbc 100644
--- a/static/docs/user-guide/large-dataset-optimization.md
+++ b/static/docs/user-guide/large-dataset-optimization.md
@@ -1,10 +1,11 @@
# Large Dataset Optimization
In order to track the data files and directories added with `dvc add` or
-`dvc run`, DVC moves all these files to a special cache directory.
-The DVC cache is a hidden storage (by default located in `.dvc/cache`) for files
-that are under DVC control, and their different versions. (See `dvc cache` and
-[DVC internal files](/doc/user-guide/dvc-files-and-directories) for more
+`dvc run`, DVC moves all these files to a special cache. A
+DVC project's cache is the hidden storage (by default located in
+`.dvc/cache`) for files that are under DVC control, and their different
+versions. (See `dvc cache` and
+[DVC Files and Directories](/doc/user-guide/dvc-files-and-directories) for more
details.)
However, the versions of the tracked files that
@@ -37,11 +38,11 @@ Symbolic links, and Reflinks in more recent systems. While reflinks bring all
the benefits and none of the worries, they're not commonly supported in most
platforms yet. Hard/soft links optimize **speed** and **space** in the file
system, but may break your workflow since updating hard/sym-linked files tracked
-by DVC in the workspace causes cache corruption. These 2 link types thus require
-using cache **protected mode** (see the `cache.protected` config option in
-`dvc config cache`). Finally, a 4th "linking" option is to actually copy files
-from/to the cache, which is safe but inefficient, especially for large files
-(several GBs or more data).
+by DVC in the workspace causes cache corruption. These 2 link types
+thus require using cache **protected mode** (see the `cache.protected` config
+option in `dvc config cache`). Finally, a 4th "linking" option is to actually
+copy files from/to the cache, which is safe but inefficient, especially for
+large files (several GBs or more data).
> Some versions of Windows (e.g. Windows Server 2012+ and Windows 10 Enterprise)
> support hard or soft links on the
diff --git a/static/docs/user-guide/update-tracked-file.md b/static/docs/user-guide/update-tracked-file.md
index 09b63a9de6..10ca5ebc9e 100644
--- a/static/docs/user-guide/update-tracked-file.md
+++ b/static/docs/user-guide/update-tracked-file.md
@@ -16,7 +16,7 @@ may mean either replacing `train.tsv` with a new file having the same name or
editing the content of the file.
If you run `dvc repro` there is no need to manage generated (output) files
-manually, DVC removes them for you before running the stage which generates
+manually, DVC removes them for you before executing the stage which generates
them.
If you use DVC to track a file that is generated during your pipeline (e.g. some