Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
3f1217e
guide: update Ext Data guide link to add to-cache/remote examples
jorgeorpinel Mar 13, 2021
bab95a9
ref: config options copy edits
jorgeorpinel Mar 14, 2021
69cbbb6
ref: destroy copy edit
jorgeorpinel Mar 14, 2021
a1e609e
Merge branch 'master' into jorge +
jorgeorpinel Mar 15, 2021
cd599c8
ref: fix mac config file locs
jorgeorpinel Mar 15, 2021
eb4af97
ref: small update to plots * --open
jorgeorpinel Mar 15, 2021
043db23
ref: clarify and correct info on add/import to-cache/remote strategies
jorgeorpinel Mar 15, 2021
fa82c89
ref: import-url vs import in terms of remote sync
jorgeorpinel Mar 15, 2021
5729f49
ref: roll back changes unrelated to get/import from this PR
jorgeorpinel Mar 15, 2021
c393212
ref: remove wrong info about import* to-cache
jorgeorpinel Mar 15, 2021
b020b4e
Update content/docs/command-reference/add.md
jorgeorpinel Mar 15, 2021
86813ad
Restyled by prettier
restyled-commits Mar 15, 2021
725aa92
Merge pull request #2303 from iterative/restyled/jorge
jorgeorpinel Mar 15, 2021
f2350ab
ref: import + push/pull notes
jorgeorpinel Mar 15, 2021
14b62cc
ref: simplify add -o
jorgeorpinel Mar 15, 2021
d63b07f
ref: update add --to-remote desc
jorgeorpinel Mar 15, 2021
94010d7
ref: simplify add -o example intro
jorgeorpinel Mar 15, 2021
562b63c
ref: mention soft/hard links in add -o example
jorgeorpinel Mar 15, 2021
f694719
ref: external data cop edits
jorgeorpinel Mar 15, 2021
d5e793e
ref: avoid term "transfer" for -o/-to-remote (1)
jorgeorpinel Mar 15, 2021
0166b1f
ref: relink to add/import -o/-to-remote examples including
jorgeorpinel Mar 16, 2021
696fa53
ref: updated add/import to-cache/remote example titles
jorgeorpinel Mar 16, 2021
15097f1
ref: a couple more copy edits to add -o/-to-remote
jorgeorpinel Mar 16, 2021
31394de
ref: update --to-remote copy edits
jorgeorpinel Mar 16, 2021
0f2a2b1
ref: roll back changes not related to #2302
jorgeorpinel Mar 16, 2021
3d10596
Merge branch 'master' into jorge
jorgeorpinel Mar 17, 2021
1082c85
ref: clarfy --out option
jorgeorpinel Mar 17, 2021
b943df5
ref: rename add -o/-to-remote examples
jorgeorpinel Mar 17, 2021
b25da5c
ref: other copy edits to add -o/-to-remote
jorgeorpinel Mar 17, 2021
aa171d4
ref: no hard links for add -o + ext cache
jorgeorpinel Mar 17, 2021
61f8806
ref: more edits to add/import-url to-cache/remote
jorgeorpinel Mar 17, 2021
d5f284a
Merge branch 'master' into jorge
jorgeorpinel Mar 18, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
159 changes: 75 additions & 84 deletions content/docs/command-reference/add.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,23 +33,23 @@ option to avoid this, and `dvc commit` to finish the process when needed).
> See also `dvc.yaml` and `dvc run` for more advanced ways to track and version
> intermediate and final results (like ML models).

After checking that each `target` hasn't been added before (or tracked with
other DVC commands), a few actions are taken under the hood:
After checking that each `target` isn't already tracked with DVC, a few actions
are taken under the hood:

1. Calculate the file hash.
2. Move the file contents to the cache (by default in `.dvc/cache`) (or to
remote storage if `--to-remote` is given), using the file hash to form the
cached file path. (See
2. Move the file contents to the cache, using the file hash to form the cached
file path (see
[Structure of cache directory](/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory)
for more details.)
3. Attempt to replace the file with a link to the cached data (more details on
file linking further down). Skipped if `--to-remote` is used.
4. Create a corresponding `.dvc` file to track the file, using its path and hash
to identify the cached data (with `--to-remote`/`-o`, an external path is
moved to the workspace). The `.dvc` file lists the DVC-tracked file as an
<abbr>output</abbr> (`outs` field). Unless the `--file` option is used, the
`.dvc` file name generated by default is `<file>.dvc`, where `<file>` is the
file name of the first target.
for details). Using `--out`, or `--to-remote` with an external target, the
data is copied instead (to cache or remote storage).
3. Attempt to replace the file with a link to (or copy of) the cached data (more
details on file linking ahead). A new link is created if a different `--out`
`path` is given. Skipped if `--to-remote` is used
4. Create a `.dvc` file to track the file or directory, saving it's path, and
the hash as a pointer to the cached data. The `.dvc` file lists the data as
an <abbr>output</abbr> (`outs` field). Unless the `--file` option is used,
the `.dvc` file name generated by default is `<file>.dvc`, where `<file>` is
the file name of the first target.
5. Add the `targets` to `.gitignore` in order to prevent them from being
committed to the Git repository (unless `dvc init --no-scm` was used when
initializing the <abbr>DVC project</abbr>).
Expand Down Expand Up @@ -145,28 +145,32 @@ not.
[pattern](https://docs.python.org/3/library/glob.html) specified in `targets`.
Shell style wildcards supported: `*`, `?`, `[seq]`, `[!seq]`, and `**`

- `--external` - allow `targets` that are outside of the DVC repository. See
- `-o <path>`, `--out <path>` - destination `path` inside the workspace to place
the data target. By default the data file basename is used in the current
working directory (if this option isn't used). Directories in the given `path`
will be created. Note that for external targets, this can be combined
[with an external cache](#example-external-data) to skip the local file
system.

- `--to-remote` - allow a target outside of the DVC repository (e.g. an S3
object, SSH directory URL, file on mounted volume, etc.) but don't move it
into the workspace, nor cache it. [Store a copy](#straight-to-remote) on a DVC
remote instead (the default one unless `-r` is specified) to skip the local
file system. Use `dvc pull` to get the data later.

- `-r <name>`, `--remote <name>` - name of the
[remote](/doc/command-reference/remote) to store data on (can only be used
with `--to-remote`).

- `--external` - allow `targets` that are outside of the DVC repository, to
track in-place. See
[Managing External Data](/doc/user-guide/managing-external-data).

> ⚠️ Note that this is an advanced feature for very specific situations and
> not recommended except if there's absolutely no other alternative.
> Additionally, this typically requires an external cache setup (see link
> above).

- `-o <path>`, `--out <path>` - destination `path` to make a local target copy,
or to [transfer](#example-transfer-to-cache) an external target into the cache
(and link to workspace). Note that this can be combined with `--to-remote` to
avoid storing the data locally, while still adding it to the project.

- `--to-remote` - import an external target, but don't move it into the
workspace, nor cache it. [Transfer it](#example-transfer-to-remote-storage) it
directly to remote storage (the default one, unless `-r` is specified)
instead. Use `dvc pull` to get the data locally.

- `-r <name>`, `--remote <name>` - name of the
[remote storage](/doc/command-reference/remote) to transfer external target to
(can only be used with `--to-remote`).

- `--desc <text>` - user description of the data (optional). This doesn't affect
any DVC operations.

Expand Down Expand Up @@ -336,95 +340,82 @@ $ tree .dvc/cache
Only the hash values of the `dir/` directory (with `.dir` file extension) and
`file2` have been cached.

## Example: Transfer to the cache

When you have a large dataset in an external location, you may want to add it to
the <abbr>project</abbr> without having to copy it into the workspace. Maybe
your local disk doesn't have enough space, but you have setup an
[external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache)
that could handle it.
## Example: External data

The `--out` option lets you add external paths in a way that they are
<abbr>cached</abbr> first, and then
[linked](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache)
to a given path inside the <abbr>workspace<abbr>. Let's initialize an example
DVC project to try this:
Sometimes you may want to add a large dataset currently found in an external
location. But what if there's not enough disk space to download the data? Here's
one method!

```dvc
$ mkdir example # workspace
$ cd example
$ git init
$ dvc init
```
The `--out` option lets you add external so that it's linked to a given path
inside the <abbr>workspace</abbr> after being copied to the <abbr>cache</abbr>.
Combined with
[symlinking](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache)
an
[external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache),
this let's you avoid using the local file system completely.

Now we can add a `data.xml` file via HTTP for example, putting it a local path
in our project:
For example, we can add a `data.xml` file via HTTP, outputting it to a local
path in our project:

```dvc
$ dvc add https://data.dvc.org/get-started/data.xml -o data.xml
$ dvc add https://data.dvc.org/get-started/data.xml -o raw/data.xml

$ ls
data.xml data.xml.dvc
```

The resulting `.dvc` file will save the provided local `path` as if the data was
already in the workspace, while the `md5` hash points to the copy of the data
that has now been transferred to the <abbr>cache</abbr>. Let's check the
contents of `data.xml.dvc` in this case:
Comment thread
jorgeorpinel marked this conversation as resolved.
The local `data.xml` should be a symlink to the (externally) <abbr>cached</abbr>
data copy. The resulting `.dvc` file will save the local `path` as if the data
was already there before this command. Let's check the contents of
`data.xml.dvc`:

```yaml
outs:
- md5: a304afb96060aad90176268345e10355
nfiles: 1
path: data.xml
path: raw/data.xml
```

> For a similar operation that actually keeps a connection to the data source,
> please see `dvc import-url`.

## Example: Transfer to remote storage
## Example: `--to-remote` usage {#straight-to-remote}

When you have a large dataset in an external location, you may want to track it
as if it was in your project, but without downloading it locally (for now). The
`--to-remote` option lets you do so, while storing a copy
[remotely](/doc/command-reference/remote) so it can be
[pulled](/doc/command-reference/plots) later. Let's initialize a DVC project,
and setup a remote:
Here's another method to add a large dataset found in an external location
without downloading the data (refer to previous example).

```dvc
$ mkdir example # workspace
$ cd example
$ git init
$ dvc init
$ mkdir /tmp/dvc-storage
$ dvc remote add myremote /tmp/dvc-storage
```
The `--to-remote` option lets you store a copy of the target data on a
[DVC remote](/doc/command-reference/remote), while creating a `.dvc` file
locally so it can be [pulled](/doc/command-reference/plots) later. This is a way
to "bootstrap" a project in your local machine, to be
[reproduced](/doc/command-reference/repro) on the right environment later (e.g.
a GPU cloud server or a CI/CD system).

Now let's add the `data.xml` to our remote storage from the given remote
location.
Let's setup a simple remote and add a `data.xml` file from the web this way:

```dvc
$ mkdir /tmp/dvc-storage
$ dvc remote add myremote /tmp/dvc-storage
$ dvc add https://data.dvc.org/get-started/data.xml -o data.xml \
--to-remote -r myremote
...
```

The only difference that dataset is transferred straight to remote, so DVC won't
control the remote location you gave but rather continue managing your remote
storage where the data is now on. The operation will still be resulted with an
`.dvc` file:

```dvc
$ ls
data.xml.dvc
```

Whenever anyone wants to actually download the added data (for example from a
system that can handle it), they can use `dvc pull` as usual:
> Note that this can be combined with `--out` to specify a local destination
> `path` (written to the `.dvc` file).

```dvc
$ dvc pull data.xml.dvc -r tmp_remote
DVC won't control the original data source after this, but rather continue
managing your remote storage, where the data is now found. To actually download
the data to <abbr>cache</abbr>, you can use `dvc fetch` or `dvc pull` as usual
(on a system that can handle it):

```dvc
$ dvc pull data.xml.dvc -r tmp_remote
A data.xml
1 file added and 1 file fetched
```

> Note that `dvc repro` will try to download the data too, as part of the
> pipeline execution.
8 changes: 4 additions & 4 deletions content/docs/command-reference/get.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,10 +56,10 @@ name.

## Options

- `-o <path>`, `--out <path>` - specify a path to the desired location in the
workspace to place the downloaded file or directory (instead of using the
current working directory). Directories specified in the path will be created
by this command.
- `-o <path>`, `--out <path>` - destination `path` to place the downloaded file
or directory. By default the data file basename is used in the current working
directory (if this option isn't used). Directories in the given `path` will be
created.

- `--rev <commit>` - commit hash, branch or tag name, etc. (any
[Git revision](https://git-scm.com/docs/revisions)) of the repository to
Expand Down
82 changes: 35 additions & 47 deletions content/docs/command-reference/import-url.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,9 +23,8 @@ positional arguments:
## Description

In some cases it's convenient to add a data file or directory from an external
location into the workspace (or to
[remote storage](/doc/command-reference/remote)), such that it can be updated
later, if/when the external data source changes. Example scenarios:
location into the project, such that it can be updated later if/when the
external data source changes. Example scenarios:

- A remote system may produce occasional data files that are used in other
projects.
Expand All @@ -37,22 +36,20 @@ later, if/when the external data source changes. Example scenarios:

`dvc import-url` helps you create such an external data dependency, without
having to manually copy files from the supported locations (listed below), which
may require installing a different tool for each type.

When you don't want to store the target data in your local system, you can still
create an import `.dvc` file while transferring a file or directory directly to
remote storage, by using the `--to-remote` option. See the
[Transfer to remote storage](#example-transfer-to-remote-storage) example for
more details.
would require installing/using a different tool for each type.

The `url` argument specifies the external location of the data to be imported.
The imported data is <abbr>cached</abbr>, and linked (or copied) to the current
working directory with its original file name e.g. `data.txt` (or to a location
provided with `out`).
working directory with its original file name e.g. `data.txt`, or to a location
provided with `out`.

An _import `.dvc` file_ is created in the same location e.g. `data.txt.dvc` –
similar to using `dvc add` after downloading the data. This makes it possible to
update the import later, if the data source has changed (see `dvc update`).
similar to using `dvc add` after downloading the data. It saves the information
about the data source, so the import can be updated later if the data source has
changed (see `dvc update`).

💡 The `--to-remote` option lets you store an import on a
[DVC remote](/doc/command-reference/remote) without using the local file system.

> Note that the imported data can be [pushed](/doc/command-reference/push) to
> remote storage normally.
Expand All @@ -64,8 +61,9 @@ field contains the corresponding local path in the <abbr>workspace</abbr>. It
records enough metadata about the imported data to enable DVC efficiently
determining whether the local copy is out of date.

Note that `dvc repro` doesn't check or update import `.dvc` files, use
`dvc update` to bring the import up to date from the data source.
Note that `dvc repro` doesn't check or update import `.dvc` files by default
(see `dvc freeze`), use `dvc update` to bring the import up to date from the
data source.

DVC supports several types of external locations (protocols):

Expand Down Expand Up @@ -140,13 +138,13 @@ $ dvc run -n download_data \
want to "DVCfy" this state of the project (see also `dvc commit`).

- `--to-remote` - import an external target, but don't move it into the
workspace, nor cache it. [Transfer](#example-import-straight-to-the-remote) it
directly to remote storage (the default one, unless `-r` is specified)
instead. Use `dvc pull` to get the data locally.
workspace, nor cache it. [Store a copy](#straight-to-remote) on a remote
instead (the default one unless `-r` is specified). Use `dvc pull` to get the
data locally.

- `-r <name>`, `--remote <name>` - name of the
[remote storage](/doc/command-reference/remote) (can only be used with
`--to-remote`).
[remote](/doc/command-reference/remote) to store data on (can only be used
with `--to-remote`).

- `-j <number>`, `--jobs <number>` - parallelism level for DVC to download data
from the source. The default value is `4 * cpu_count()`. For SSH remotes, the
Expand Down Expand Up @@ -358,46 +356,36 @@ Running stage 'prepare' with command:
python src/prepare.py data/data.xml
```

## Example: Transfer to remote storage
## Example: `--to-remote` usage {#straight-to-remote}

When you have a large dataset in an external location, you may want to import it
to your project without downloading it to the local file system (for using it
later/elsewhere). The `--to-remote` option let you skip the download, while
storing the imported data [remotely](/doc/command-reference/remote). Let's
initialize a DVC project, and setup a remote:
Normally, `dvc import-url` downloads the target data (to the <abbr>cache</abbr>)
in order to link and track it locally. But what if there's not enough disk
space?

```dvc
$ mkdir example # workspace
$ cd example
$ git init
$ dvc init
$ mkdir /tmp/dvc-storage
$ dvc remote add myremote /tmp/dvc-storage
```
The `--to-remote` option lets you store a copy of the target data on a
[DVC remote](/doc/command-reference/remote), while creating an import `.dvc`
file locally so it can be [pulled](/doc/command-reference/plots) later. This is
a way to "bootstrap" an import in your local machine, to be downloaded on the
right environment later.

Now let's create an import `.dvc` file without downloading the target data,
transferring it directly to remote storage instead:
Let's setup a simple remote and add a `data.xml` file from the web this way:

```
$ mkdir /tmp/dvc-storage
$ dvc remote add myremote /tmp/dvc-storage
$ dvc import-url https://data.dvc.org/get-started/data.xml data.xml \
--to-remote -r myremote
...
```

The only change in our local <abbr>workspace</abbr> is a newly created import
`.dvc` file:

```dvc
$ ls
data.xml.dvc
```

Whenever anyone wants to actually download the imported data (for example from a
system that can handle it), they can use `dvc pull` as usual:
The only change in our local <abbr>workspace</abbr> is the tiny `.dvc` file that
was created. To actually download the data to <abbr>cache</abbr>, you can use
`dvc fetch` or `dvc pull` as usual (on a system that can handle it):

```
$ dvc pull data.xml.dvc -r tmp_remote

$ dvc pull data.xml.dvc -r tmp_remote
A data.xml
1 file added and 1 file fetched
```
Loading