Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
ff3e9d6
term: change `ENV` for env(ironment) in contributing user guide
jorgeorpinel Sep 1, 2019
d86008a
remove: update help output
jorgeorpinel Sep 1, 2019
c38e6ce
get/import: add link and clarify HEAD `--rev` option default
jorgeorpinel Sep 1, 2019
07fb5b0
changelog: remove extra space in changelog/0.35
jorgeorpinel Sep 1, 2019
5bb7ede
get-started: reformat add-files
jorgeorpinel Sep 1, 2019
8ec0d54
docs: review usage of "DVC" branding of terms (1)
jorgeorpinel Sep 2, 2019
0413387
term: "remote cache" -> "remote storage"
jorgeorpinel Sep 2, 2019
182ff1b
term: review usage of "DVC" branding (3) through static/docs/commands…
jorgeorpinel Sep 2, 2019
26d3692
term: most "local cache" -> "cache directory" / "project cache"
jorgeorpinel Sep 2, 2019
fa93646
term: data set -> dataset
jorgeorpinel Sep 2, 2019
e91df63
term: "run(s)/ran again" -> "regenerate" (repro context)
jorgeorpinel Sep 2, 2019
903cea3
Merge branch 'master' into jorgeorpinel
jorgeorpinel Sep 2, 2019
b022c85
term: review usage of "dependency graph" (and related), "DAG", and
jorgeorpinel Sep 3, 2019
1665812
cmd ref: update "Data and pipelines are up to date." phrase
jorgeorpinel Sep 3, 2019
492cfc6
term: improve usage of "regenreate" and "execute" for stages/pipeline…
jorgeorpinel Sep 3, 2019
d81791d
term: reduse usage of "again", especially in the contest of `dvc repro`
jorgeorpinel Sep 3, 2019
65fbec3
glossary: update "workspace" term, and improve related user-guide des…
jorgeorpinel Sep 4, 2019
84c42fe
Merge branch 'master' into jorgeorpinel
jorgeorpinel Sep 4, 2019
3f7884b
term: stop using glossary entry "cache directory", related updates
jorgeorpinel Sep 5, 2019
3c0db9f
user-guide: link "cache directory" term where appropriate
jorgeorpinel Sep 5, 2019
5fe82e7
cmd ref: change from HEAD to "tip of default branch" in --rev option …
jorgeorpinel Sep 5, 2019
a5019ef
get-started: reword stage file commands explanation
jorgeorpinel Sep 5, 2019
46aa961
cmd ref: fix closing `)` in run and hyphenate "non-deterministic" in …
jorgeorpinel Sep 5, 2019
8753d05
cmd ref: explain outputs better in `add`
jorgeorpinel Sep 5, 2019
e4ce024
comlpemenet last commit
jorgeorpinel Sep 5, 2019
62742ed
term: review DVC branding up to static/docs/commands-reference/metrics
jorgeorpinel Sep 5, 2019
f9ab91a
term: review "runs" throughout
jorgeorpinel Sep 5, 2019
03eaa8b
term: review usage of "data remote" and include "remote storage" more
jorgeorpinel Sep 5, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion src/Documentation/glossary.js
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,9 @@ Directory containing all your project files. For example raw datasets, source
code, ML models, etc. A workspace becomes a **DVC project** when
[\`dvc init\`](/doc/commands-reference/init) is run, and
[DVC-files](/doc/user-guide/dvc-file-format) or stage files are created in it.

Note that [external outputs](/doc/user-guide/external-outputs) also form part
of your expanded workspace, technically.
`
},
{
Expand All @@ -26,7 +29,7 @@ Initialized by running \`dvc init\` in the **workspace**. It will contain the
},
{
name: 'DVC Cache',
match: ['DVC cache', 'cache', 'cache directory', 'data cache', 'cached'],
match: ['DVC cache', 'cache', 'cached'],
desc: `
The DVC cache is a hidden storage (by default located in the \`.dvc/cache\`
directory) for files that are under DVC control, and their different versions.
Expand Down
2 changes: 1 addition & 1 deletion static/docs/changelog/0.18.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ really excited to share the progress with you:
- ⚡ **DVC just got faster**

- Data files management commands like `dvc add`, `dvc push`, `dvc pull`, etc.
got up to 10x faster on data sets with large number of files.
got up to 10x faster on datasets with large number of files.

- Commands startup latency reduced 3x

Expand Down
8 changes: 4 additions & 4 deletions static/docs/changelog/0.35.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ improvements) we have done in the last few months:

- 📖 The [Get Started](/doc/get-started/agenda) section has been simplified
(e.g. to use tags instead of branches) and extended. We have also prepared a
[Github DVC project ](https://github.com/iterative/example-get-started) that
[DVC project on Github](https://github.com/iterative/example-get-started) that
reflects the sequence of chapters in the “get started” guide. You can now
download the whole project and reproduce all the models.

Expand All @@ -41,8 +41,8 @@ improvements) we have done in the last few months:

- We’ve introduced the DVC commit command and `dvc run/repro/add --no-commit`
flag to give a way to **avoid uncontrolled cache growth** and as a way to save
some `dvc repro` runs. In the future we plan to have “do-not-cache-my-data” as
a default mode for `dvc run`, `dvc add` and `dvc repro`.
some runs of `dvc repro`. In the future we plan to have “do-not-cache-my-data”
as a default mode for `dvc run`, `dvc add` and `dvc repro`.

- **SSH remotes (data storage) support** - config options to set port, key
files, timeouts, password, etc + improved stability and Windows support!
Expand All @@ -63,7 +63,7 @@ improvements) we have done in the last few months:
general user experience for the commands that navigate tags or branches (all
the commands that include `--all-bracnhes`, `-a` or `--all-tags`, `-T`).

There are new [DVC integrations and plugins](/doc/user-guide/plugins) available:
There are new [integrations and plugins](/doc/user-guide/plugins) available:

- Finally there is an official
[Bash and Zsh completion](/doc/user-guide/autocomplete) for DVC!
Expand Down
82 changes: 44 additions & 38 deletions static/docs/commands-reference/add.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# add

Take a data file or a directory under DVC control (by creating a corresponding
DVC-file).
[DVC-file](/doc/user-guide/dvc-file-format)).

## Synopsis

Expand All @@ -15,37 +15,43 @@ positional arguments:

## Description

The `dvc add` command is analogous to the `git add` command. By default an added
file is committed to the DVC cache. Using the `--no-commit` option, the file
will not be added to the cache and instead the `dvc commit` command is used when
(or if) the file is to be committed to the DVC cache.
The `dvc add` command is analogous to the `git add` command. By default though,
an added file or directory is also committed to the <abbr>cache</abbr>. (Use the
`--no-commit` option to avoid this, and `dvc commit` as a separate step when
ready.)

Under the hood, a few actions are taken for each file in `targets`:
The `targets` are files or directories to be places under DVC control. These are
turned into outputs (`outs` field) in a resulting
[DVC-file](/doc/user-guide/dvc-file-format). (See steps below for more details.)
Note that target data outside the current <abbr>workspace</abbr> is supported,
which becomes [external outputs](/doc/user-guide/external-outputs).

Under the hood, a few actions are taken for each file (or directory) in
`targets`:

1. Calculate the file checksum.
2. Move the file content to the DVC cache (default location is `.dvc/cache`).
3. Replace the file by a link to the file in the cache (see details below).
4. Create a corresponding [DVC-file](/doc/user-guide/dvc-file-format) and store
the MD5 checksum to identify the cache entry.
5. Add the targets to `.gitignore` (if Git is used in this
<abbr>workspace</abbr>) to prevent it from being committed to the Git
2. Move the file contents to the cache directory (by default in `.dvc/cache`),
using the checksum to form the cached file name.
3. Replace the file by a link to the file in cache (see details below).
4. Create a corresponding DVC-file and store the checksum to identify the cached
file. Unless the `-f` option is used, the DVC-file name generated by default
is `<file>.dvc`, where `<file>` is the file name of the first target.
5. Unless `dvc init --no-scm` was used when initializing the project, add the
`targets` to `.gitignore` in order to prevent them from being committed to
the Git repository.
6. Unless `dvc init --no-scm` was used when initializing the project,
instructions are printed showing `git` commands for adding the files to a Git
repository.
6. Instructions are printed showing `git` commands for adding the files to a Git
repository. If a different SCM system is being used, use the equivalent
command for that system or nothing is printed if `--no-scm` was specified for
the repository.

Unless the `-f` options is used, by default the DVC-file name generated is
`<file>.dvc`, where `<file>` is file name of the first output (from `targets`).

The result is data file is added to the DVC cache, and DVC-files can be tracked
via Git or other version control system. The DVC-file lists the added file as an
output (`out`), and references the DVC cache entry using the checksum. See
The result is that the target data gets cached by DVC, and instead small
DVC-files can be tracked with Git. The DVC-file lists the added file as an
output (`outs` field), and references the cached file using the checksum. See
[DVC-File Format](/doc/user-guide/dvc-file-format) for more details.

> Note that DVC-files created by this command are _orphans_: they have no
> dependencies. _Orphan_ "stage files" are always considered _changed_ by
> `dvc repro`, which always executes them.
> Note that DVC-files created by this command are considered _orphans_ because
> they have no dependencies, only outputs. These _orphan_ "stage files" are
> always treated as _changed_ by `dvc repro`, which always executes them. See
> `dvc run` to learn about regular stage files.

By default DVC tries to use reflinks (see
[File link types](/docs/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache)
Expand All @@ -59,14 +65,14 @@ to work with directory hierarchies with `dvc add`.

1. With `dvc add --recursive`, the hierarchy is traversed and every file is
added individually as described above. This means every file has its own
DVC-file, and a corresponding DVC cache entry is made (unless `--no-commit`
flag is added).
DVC-file, and a corresponding cached file is created (unless the
`--no-commit` flag is used).
2. When not using `--recursive` a DVC-file is created for the top of the
directory (with default name `dirname.dvc`). Every file in the hierarchy is
added to the DVC cache (unless `--no-commit` flag is added), but DVC does not
added to the cache (unless `--no-commit` flag is added), but DVC does not
produce individual DVC-files for each file in the directory tree. Instead,
the single DVC-file points to a file in the DVC cache that contains
references to the files in the added hierarchy.
the single DVC-file points to a file in the cache that contains references to
the files in the added hierarchy.

In a <abbr>DVC project</abbr>, `dvc add` can be used to version control any
<abbr>data artifact</abbr> (input, intermediate, or output files and
Expand All @@ -84,10 +90,10 @@ and make your project reproducible.
found, a new DVC-file is created using the process described in this command's
description.

- `--no-commit` - do not put files/directories into cache. A DVC-file is
created, and an entry is added to `.dvc/state`, while nothing is added to the
cache. Use `dvc commit` when you are ready to save your results to cache. This
is analogous to using `git add` before `git commit`.
- `--no-commit` - do not save outputs to cache. A DVC-file is created, and an
entry is added to `.dvc/state`, while nothing is added to the cache. This is
analogous to using `git add` before `git commit`. Use `dvc commit` when ready
to commit the results to cache.

> The `dvc status` command will mention that the file is `not in cache`.

Expand Down Expand Up @@ -194,9 +200,9 @@ Saving information to 'pics.dvc'.
```

There are no DVC-files generated within this directory structure, but the images
are all added to the DVC cache. DVC prints a message to that effect, saying that
`md5` values are computed for each directory. A DVC-file is generated for the
top-level directory, and it contains this:
are all added to the <abbr>cache</abbr>. DVC prints a message to that effect,
saying that `md5` values are computed for each directory. A DVC-file is
generated for the top-level directory, and it contains this:

```yaml
md5: df06d8d51e6483ed5a74d3979f8fe42e
Expand Down Expand Up @@ -225,7 +231,7 @@ top-level DVC-file is generated. But this is less convenient.

With the `dvc add pics` a single DVC-file is generated, `pics.dvc`, which lets
us treat the entire directory structure in one unit. It lets you pass the whole
directory tree as a dependency to a `dvc run` stage like so:
directory tree as a dependency to a `dvc run` stage definition, like this:

```dvc
$ dvc run -f train.dvc \
Expand Down
12 changes: 7 additions & 5 deletions static/docs/commands-reference/cache/dir.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# cache dir

Set/unset the <abbr>cache directory</abbr> location intuitively (compared to
Set/unset the <abbr>cache</abbr> directory location intuitively (compared to
using `dvc config cache`).

## Synopsis
Expand All @@ -16,10 +16,12 @@ positional arguments:

## Description

Helper to set the `cache.dir` configuration option. Unlike doing so with
`dvc config cache`, this command transform paths (`value`) that are provided
relative to the current working directory into paths **relative to the config
file location**. They are required in the latter form for the config file.
Helper to set the `cache.dir` configuration option. (See
[cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory).)
Unlike doing so with `dvc config cache`, this command transform paths (`value`)
that are provided relative to the current working directory into paths
**relative to the config file location**. They are required in the latter form
for the config file.

## Options

Expand Down
12 changes: 6 additions & 6 deletions static/docs/commands-reference/cache/index.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# cache

Contains a helper command to set the <abbr>cache directory</abbr> location:
Contains a helper command to set the <abbr>cache</abbr> directory location:
[dir](/doc/commands-reference/cache/dir).

## Synopsis
Expand All @@ -15,12 +15,12 @@ positional arguments:

## Description

After DVC initialization, a hidden directory `.dvc/` is created with the
[DVC internal files](/doc/user-guide/dvc-files-and-directories), including the
default cache directory.
After DVC initialization, a hidden directory `.dvc/` is created to contain the
[DVC files and directories](/doc/user-guide/dvc-files-and-directories),
including the default cache directory.

The DVC cache is where your data files, models, etc (anything you want to
version with DVC) are actually stored. The corresponding files you see in the
The cache is where your data files, models, etc (anything you want to version
with DVC) are actually stored. The corresponding files you see in the
<abbr>workspace</abbr> simply link to the ones in cache. (See
`dvc config cache`, `type` config option, for more information on file links on
different platforms.)
Expand Down
28 changes: 14 additions & 14 deletions static/docs/commands-reference/checkout.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,14 +16,15 @@ positional arguments:

## Description

[DVC-files](/doc/user-guide/dvc-file-format) in a <abbr>DVC project</abbr>
specify which instance of each data file or directory is to be used, using the
checksum saved in the `outs` fields. The `dvc checkout` command updates the
workspace data to match with the cache files corresponding to those checksums.
[DVC-files](/doc/user-guide/dvc-file-format) in a <abbr>project</abbr> specify
which instance of each data file or directory is to be used, using the checksum
saved in the `outs` fields. The `dvc checkout` command updates the workspace
data to match with the <abbr>cached</abbr> files corresponding to those
checksums.

Using an SCM like Git, the DVC-files are kept under version control. At a given
branch or tag of the SCM repository, the DVC-files will contain checksums for
the corresponding data files kept in the DVC cache. After an SCM command like
the corresponding data files kept in the cache. After an SCM command like
`git checkout` is run, the DVC-files will change to the state at the specified
branch or commit or tag. Afterwards, the `dvc checkout` command is required in
order to synchronize the data files with the currently checked out DVC-files.
Expand Down Expand Up @@ -64,8 +65,8 @@ restoring any file size will be almost instantaneous.
> `cache.slow_link_warning` config option to `false` with `dvc config cache`.

The output of `dvc checkout` does not list which data files were restored. It
does report removed files and files that DVC was unable to restore due to it
missing from the cache.
does report removed files and files that DVC was unable to restore because
they're missing from the <abbr>cache</abbr>.

This command will fail to checkout files that are missing from the cache. In
such a case, `dvc checkout` prints a warning message. Any files that can be
Expand All @@ -74,7 +75,7 @@ checked out without error will be restored.
There are two methods to restore a file missing from the cache, depending on the
situation. In some cases a pipeline must be reproduced (using `dvc repro`) to
regenerate its outputs. (See also `dvc pipeline`.) In other cases the cache can
be pulled from a remote cache using `dvc pull`.
be pulled from remote storage using `dvc pull`.

## Options

Expand All @@ -90,10 +91,9 @@ be pulled from a remote cache using `dvc pull`.
inspect.

- `-f`, `--force` - does not prompt when removing workspace files. Changing the
current set of DVC-files with SCM commands like `git checkout` can result in
the need for DVC to remove files which should not exist in the current state
and are missing in the local cache (they are not committed in DVC terms). This
option controls whether the user will be asked to confirm these files removal.
current set of DVC-files with `git checkout` can result in the need for DVC to
remove files that don't match those DVC-file references or are missing from
cache. (They are not "committed", in DVC terms.)

- `-h`, `--help` - shows the help message and exit.

Expand Down Expand Up @@ -205,8 +205,8 @@ MD5 (model.pkl) = a66489653d1b6a8ba989799367b32c43
```

What happened is that DVC went through the sole existing DVC-file and adjusted
the current set of files to match the `outs` of that stage. `dvc fetch` command
runs once to download missing data from the remote storage to the local cache.
the current set of files to match the `outs` of that stage. `dvc fetch` is run
once to download missing data from the remote storage to the <abbr>cache</abbr>.
Alternatively, we could have just run `dvc pull` in this case to automatically
do `dvc fetch` + `dvc checkout`.

Expand Down
Loading