From 9d319356526ea682790e4afa42ec59c378005ee7 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 28 Aug 2020 12:06:46 -0500 Subject: [PATCH 1/8] guide: update x data guide for 1.x and other impros, as as well as a note about not supporting x metrics/plots per https://discord.com/channels/485586884165107732/485596304961962003/748857709830340698 --- .../docs/user-guide/managing-external-data.md | 43 ++++++++++++------- 1 file changed, 28 insertions(+), 15 deletions(-) diff --git a/content/docs/user-guide/managing-external-data.md b/content/docs/user-guide/managing-external-data.md index 2348ee977a..9cd8b9fd72 100644 --- a/content/docs/user-guide/managing-external-data.md +++ b/content/docs/user-guide/managing-external-data.md @@ -6,20 +6,19 @@ example from a network attached storage (NAS) drive, processing data on HDFS, running [Dask](https://dask.org/) via SSH, or having a script that streams data from S3 to process it. External outputs and [external dependencies](/doc/user-guide/external-dependencies) provide a way for -DVC to control data outside of the project directory. +DVC to track data outside of the project. ## Description -DVC can track files on an external storage with `dvc add` or specify external -files as outputs for -[DVC-files](/doc/user-guide/dvc-files-and-directories) created by `dvc run` -(stage files). External outputs are considered part of the DVC project. DVC will -track changes in them and reflect this in the output of `dvc status`. +DVC can track files on an external location with `dvc add` or specify external +files or directories as outputs for `dvc.yaml` files. External +outputs are considered part of the (extended) DVC project: DVC will track +changes in them, and reflect this in `dvc status` for example. Currently, the following types (protocols) of external outputs (and cache) are supported: -- Local files and directories outside of your workspace +- Local files and directories outside the workspace - SSH - Amazon S3 - Google Cloud Storage @@ -28,17 +27,20 @@ Currently, the following types (protocols) of external outputs (and > Note that these are a subset of the remote storage types supported by > `dvc remote`. -In order to specify an external output for a stage file, use the usual `-o` or -`-O` options of `dvc run`, but with the external path or URL to the file in -question. For cached external outputs (`-o`) you will need to +In order to specify an external output for a stage file, add them to the stage +in `dvc.yaml` normally (for example with the usual `-o` or `-O` options of +`dvc run`) but with the external path or URL to the file in question. For cached +external outputs (`-o`), you will need to [setup an external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache) in the same external/remote file system first. -> Avoid using the same location of the -> [remote storage](/doc/command-reference/remote) that you have for `dvc push` -> and `dvc pull` for external outputs or as external cache, because it may cause -> file hash overlaps: The hash value of a data file in external storage could -> collide with the one generated locally for another file. +Please note that there is no support for external metrics or plots (`-m`, `-p`, +etc. options of `dvc run`). + +> Avoid using the same [DVC remote](/doc/command-reference/remote) (used for +> `dvc push`, `dvc pull`, etc.) for external outputs, because it may cause file +> hash overlaps: the hash of an external output could collide with a hash +> generated locally for another file with different content. ## Examples @@ -103,6 +105,11 @@ $ dvc run -d data.txt \ scp data.txt user@example.com:/data.txt ``` +> Please note that to use password authentication, it's necessary to set the +> `password` or `ask_password` SSH remote options first (see +> `dvc remote modify`), and use the special URL in: +> `dvc add --external remote://sshcache/mydata`. + ⚠️ DVC requires both SSH and SFTP access to work with remote SSH locations. Please check that you are able to connect both ways with tools like `ssh` and `sftp` (GNU/Linux). @@ -138,6 +145,12 @@ it. So systems like Hadoop, Hive, and HBase are supported! The default cache location is `.dvc/cache`, so there is no need to move it for local paths outside of your project. +> Except for external data on different storage devices or partitions mounted on +> the same file system (e.g. `/mnt/raid/data`). In that case please setup an +> external cache in that same drive to enable +> [file links](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache) +> and avoid copying data. + ```dvc # Add data on an external location directly $ dvc add --external /home/shared/mydata From 74b624f19928d89a8f9ee621087bcf3cb2f28e58 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 28 Aug 2020 12:08:24 -0500 Subject: [PATCH 2/8] docs: udpates to matcch x data guide update (see prev commit) --- content/docs/command-reference/config.md | 15 +++++++-------- content/docs/command-reference/run.md | 2 +- content/docs/user-guide/external-dependencies.md | 2 +- 3 files changed, 9 insertions(+), 10 deletions(-) diff --git a/content/docs/command-reference/config.md b/content/docs/command-reference/config.md index 411be369fb..ece3e83f37 100644 --- a/content/docs/command-reference/config.md +++ b/content/docs/command-reference/config.md @@ -156,14 +156,13 @@ for more details.) This section contains the following options: `dvc remote` for more information on "local remotes".) This will overwrite the value provided to `dvc config cache.dir` or `dvc cache dir`. -- `cache.ssh` - name of an - [SSH remote to use as external cache](/doc/user-guide/managing-external-data#ssh). - - > Avoid using the same remote location that you are using for `dvc push`, - > `dvc pull`, `dvc fetch` as external cache for your external outputs, because - > it may cause possible file hash overlaps: the hash of a data file in - > external storage could collide with a hash generated locally for another - > file with a different content. +- `cache.ssh` - name of an SSH remote to use + [as external cache](/doc/user-guide/managing-external-data#ssh). + + > Avoid using the same [DVC remote](/doc/command-reference/remote) (used for + > `dvc push`, `dvc pull`, etc.) as external cache, because it may cause file + > hash overlaps: the hash of an external output could collide + > with a hash generated locally for another file with different content. - `cache.s3` - name of an [Amazon S3 remote to use as external cache](/doc/user-guide/managing-external-data#amazon-s-3). diff --git a/content/docs/command-reference/run.md b/content/docs/command-reference/run.md index 7746a9cfdf..1c75275c68 100644 --- a/content/docs/command-reference/run.md +++ b/content/docs/command-reference/run.md @@ -99,7 +99,7 @@ Relevant notes: - [external dependencies](/doc/user-guide/external-dependencies) and [external outputs](/doc/user-guide/managing-external-data) (outside of the - workspace) are also supported. + workspace) are also supported (except metrics and plots). - Outputs are deleted from the workspace before executing the command (including at `dvc repro`) if their paths are found as existing files/directories. This diff --git a/content/docs/user-guide/external-dependencies.md b/content/docs/user-guide/external-dependencies.md index c23a486f12..06a78cdbdd 100644 --- a/content/docs/user-guide/external-dependencies.md +++ b/content/docs/user-guide/external-dependencies.md @@ -6,7 +6,7 @@ example from a network attached storage (NAS) drive, processing data on HDFS, running [Dask](https://dask.org/) via SSH, or having a script that streams data from S3 to process it. A mechanism for external dependencies and [external outputs](/doc/user-guide/managing-external-data) provides a way for -DVC to control data externally. +DVC to track data outside of the project. ## Description From d0f6c3c7f75e7fdaf93cb916c9ba0383e5740b65 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 28 Aug 2020 14:31:26 -0500 Subject: [PATCH 3/8] guide: more updates for x deps/outs pages per 1.x et al. --- .../docs/user-guide/external-dependencies.md | 18 ++++++++--------- .../docs/user-guide/managing-external-data.md | 20 +++++++++---------- 2 files changed, 19 insertions(+), 19 deletions(-) diff --git a/content/docs/user-guide/external-dependencies.md b/content/docs/user-guide/external-dependencies.md index 06a78cdbdd..b787451127 100644 --- a/content/docs/user-guide/external-dependencies.md +++ b/content/docs/user-guide/external-dependencies.md @@ -4,7 +4,7 @@ There are cases when data is so large, or its processing is organized in a way that you would like to avoid moving it out of its external/remote location. For example from a network attached storage (NAS) drive, processing data on HDFS, running [Dask](https://dask.org/) via SSH, or having a script that streams data -from S3 to process it. A mechanism for external dependencies and +from S3 to process it. A mechanism for external dependencies and [external outputs](/doc/user-guide/managing-external-data) provides a way for DVC to track data outside of the project. @@ -28,9 +28,9 @@ supported: > Note that these are a subset of the remote storage types supported by > `dvc remote`. -In order to specify an external dependency for your stage, use the usual `-d` -option in `dvc run` with the external path or URL to your desired file or -directory. +In order to specify an external dependency for your stage, use the +usual `-d` option in `dvc run` with the external path or URL to your desired +file or directory. ## Examples @@ -149,12 +149,12 @@ $ dvc import-url https://data.dvc.org/get-started/data.xml Importing 'https://data.dvc.org/get-started/data.xml' -> 'data.xml' ``` -The command above creates the import stage (DVC-file) -`data.xml.dvc`, that uses an external dependency (in this case an HTTPs URL). +The command above creates the import `.dvc` file `data.xml.dvc`, that contains +an external dependency (in this case an HTTPs URL).
-### Expand to see resulting DVC-file +### Expand to see resulting `.dvc` file ```yaml # ... @@ -180,7 +180,7 @@ determine whether the source has changed and we need to download the file again. `dvc import` can download a data artifact from any DVC project or Git repository. It also creates an external dependency in its -import stage (DVC-file). +import `.dvc` file. ```dvc $ dvc import git@github.com:iterative/example-get-started model.pkl @@ -193,7 +193,7 @@ specified (with the `repo` field).
-### Expand to see resulting DVC-file +### Expand to see resulting `.dvc` file ```yaml # ... diff --git a/content/docs/user-guide/managing-external-data.md b/content/docs/user-guide/managing-external-data.md index 9cd8b9fd72..7d905a55ea 100644 --- a/content/docs/user-guide/managing-external-data.md +++ b/content/docs/user-guide/managing-external-data.md @@ -4,16 +4,16 @@ There are cases when data is so large, or its processing is organized in a way that you would like to avoid moving it out of its external/remote location. For example from a network attached storage (NAS) drive, processing data on HDFS, running [Dask](https://dask.org/) via SSH, or having a script that streams data -from S3 to process it. External outputs and +from S3 to process it. External outputs and [external dependencies](/doc/user-guide/external-dependencies) provide a way for DVC to track data outside of the project. ## Description DVC can track files on an external location with `dvc add` or specify external -files or directories as outputs for `dvc.yaml` files. External -outputs are considered part of the (extended) DVC project: DVC will track -changes in them, and reflect this in `dvc status` for example. +files or directories as outputs for `dvc.yaml` files. External outputs are +considered part of the (extended) DVC project: DVC will track changes in them, +and reflect this in `dvc status` for example. Currently, the following types (protocols) of external outputs (and cache) are supported: @@ -27,10 +27,10 @@ Currently, the following types (protocols) of external outputs (and > Note that these are a subset of the remote storage types supported by > `dvc remote`. -In order to specify an external output for a stage file, add them to the stage -in `dvc.yaml` normally (for example with the usual `-o` or `-O` options of -`dvc run`) but with the external path or URL to the file in question. For cached -external outputs (`-o`), you will need to +In order to specify an external output for a stage file, add them +to the stage in `dvc.yaml` normally (for example with the usual `-o` or `-O` +options of `dvc run`) but with the external path or URL to the file in question. +For cached external outputs (`-o`), you will need to [setup an external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache) in the same external/remote file system first. @@ -45,8 +45,8 @@ etc. options of `dvc run`). ## Examples For the examples, let's take a look at a [stage](/doc/command-reference/run) -that simply moves local file to an external location, producing a `data.txt.dvc` -DVC-file. +that simply moves local file to an external location, producing the `.dvc` file +`data.txt.dvc`, that contains an external output. ### Amazon S3 From df8062355a1d5570677f9aa64f8eee6e15d0b566 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 28 Aug 2020 16:12:30 -0500 Subject: [PATCH 4/8] guide: more fixes for x data pages --- content/docs/user-guide/external-dependencies.md | 9 ++++----- content/docs/user-guide/managing-external-data.md | 4 ++-- 2 files changed, 6 insertions(+), 7 deletions(-) diff --git a/content/docs/user-guide/external-dependencies.md b/content/docs/user-guide/external-dependencies.md index b787451127..34e0f1a585 100644 --- a/content/docs/user-guide/external-dependencies.md +++ b/content/docs/user-guide/external-dependencies.md @@ -17,13 +17,12 @@ stages. DVC will track changes in them and reflect this in the output of Currently, the following types (protocols) of external dependencies are supported: -- Local files and directories outside of your workspace -- SSH - Amazon S3 - Microsoft Azure Blob Storage - Google Cloud Storage +- SSH - HDFS -- HTTP +- Local files and directories outside the workspace > Note that these are a subset of the remote storage types supported by > `dvc remote`. @@ -154,7 +153,7 @@ an external dependency (in this case an HTTPs URL).
-### Expand to see resulting `.dvc` file +### Expand to see resulting .dvc file ```yaml # ... @@ -193,7 +192,7 @@ specified (with the `repo` field).
-### Expand to see resulting `.dvc` file +### Expand to see resulting .dvc file ```yaml # ... diff --git a/content/docs/user-guide/managing-external-data.md b/content/docs/user-guide/managing-external-data.md index 7d905a55ea..943a56feb5 100644 --- a/content/docs/user-guide/managing-external-data.md +++ b/content/docs/user-guide/managing-external-data.md @@ -18,11 +18,11 @@ and reflect this in `dvc status` for example. Currently, the following types (protocols) of external outputs (and cache) are supported: -- Local files and directories outside the workspace -- SSH - Amazon S3 - Google Cloud Storage +- SSH - HDFS +- Local files and directories outside the workspace > Note that these are a subset of the remote storage types supported by > `dvc remote`. From 551df38504c6f075b6c118c9c89a37847a648d67 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 30 Aug 2020 00:23:22 -0500 Subject: [PATCH 5/8] guide: roll back accidental removal of HTTP x dep and and reorder cache.{type} options to match std. remote order per https://github.com/iterative/dvc.org/pull/1735#pullrequestreview-478062572 et al. --- content/docs/command-reference/config.md | 18 +++++++++--------- .../docs/user-guide/external-dependencies.md | 1 + 2 files changed, 10 insertions(+), 9 deletions(-) diff --git a/content/docs/command-reference/config.md b/content/docs/command-reference/config.md index ece3e83f37..43388f70f8 100644 --- a/content/docs/command-reference/config.md +++ b/content/docs/command-reference/config.md @@ -156,6 +156,15 @@ for more details.) This section contains the following options: `dvc remote` for more information on "local remotes".) This will overwrite the value provided to `dvc config cache.dir` or `dvc cache dir`. +- `cache.s3` - name of an + [Amazon S3 remote to use as external cache](/doc/user-guide/managing-external-data#amazon-s-3). + +- `cache.azure` - name of a Microsoft Azure Blob Storage remote to use as + [external cache](/doc/user-guide/managing-external-data). + +- `cache.gs` - name of a + [Google Cloud Storage remote to use as external cache](/doc/user-guide/managing-external-data#google-cloud-storage). + - `cache.ssh` - name of an SSH remote to use [as external cache](/doc/user-guide/managing-external-data#ssh). @@ -164,18 +173,9 @@ for more details.) This section contains the following options: > hash overlaps: the hash of an external output could collide > with a hash generated locally for another file with different content. -- `cache.s3` - name of an - [Amazon S3 remote to use as external cache](/doc/user-guide/managing-external-data#amazon-s-3). - -- `cache.gs` - name of a - [Google Cloud Storage remote to use as external cache](/doc/user-guide/managing-external-data#google-cloud-storage). - - `cache.hdfs` - name of an [HDFS remote to use as external cache](/doc/user-guide/managing-external-data#hdfs). -- `cache.azure` - name of a Microsoft Azure Blob Storage remote to use as - [external cache](/doc/user-guide/managing-external-data). - ### state See diff --git a/content/docs/user-guide/external-dependencies.md b/content/docs/user-guide/external-dependencies.md index 34e0f1a585..cd9789912d 100644 --- a/content/docs/user-guide/external-dependencies.md +++ b/content/docs/user-guide/external-dependencies.md @@ -22,6 +22,7 @@ supported: - Google Cloud Storage - SSH - HDFS +- HTTP - Local files and directories outside the workspace > Note that these are a subset of the remote storage types supported by From d6c68de1d06e4437cade5f0a9ea168129a1e7abe Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 30 Aug 2020 01:16:53 -0500 Subject: [PATCH 6/8] guide: std intro to both x docs and improve x outs desc and examples, inc including adding Azure example --- .../docs/user-guide/external-dependencies.md | 18 +-- .../docs/user-guide/managing-external-data.md | 107 +++++++++--------- 2 files changed, 62 insertions(+), 63 deletions(-) diff --git a/content/docs/user-guide/external-dependencies.md b/content/docs/user-guide/external-dependencies.md index cd9789912d..f08f25a1c5 100644 --- a/content/docs/user-guide/external-dependencies.md +++ b/content/docs/user-guide/external-dependencies.md @@ -1,16 +1,18 @@ # External Dependencies There are cases when data is so large, or its processing is organized in a way -that you would like to avoid moving it out of its external/remote location. For -example from a network attached storage (NAS) drive, processing data on HDFS, -running [Dask](https://dask.org/) via SSH, or having a script that streams data -from S3 to process it. A mechanism for external dependencies and -[external outputs](/doc/user-guide/managing-external-data) provides a way for -DVC to track data outside of the project. +such that you would like to avoid moving it out of its external/remote location. +For example from a network attached storage (NAS) drive, processing data on +HDFS, running [Dask](https://dask.org/) via SSH, or having a script that streams +data from S3 to process it. -## Description +External dependencies and +[external outputs](/doc/user-guide/managing-external-data) provide ways to track +data outside of the project. -With DVC, you can specify external files as dependencies for your pipeline +## How it works + +You can specify external files or directories as dependencies for your pipeline stages. DVC will track changes in them and reflect this in the output of `dvc status`. diff --git a/content/docs/user-guide/managing-external-data.md b/content/docs/user-guide/managing-external-data.md index 943a56feb5..854798d2ab 100644 --- a/content/docs/user-guide/managing-external-data.md +++ b/content/docs/user-guide/managing-external-data.md @@ -1,24 +1,33 @@ # Managing External Data There are cases when data is so large, or its processing is organized in a way -that you would like to avoid moving it out of its external/remote location. For -example from a network attached storage (NAS) drive, processing data on HDFS, -running [Dask](https://dask.org/) via SSH, or having a script that streams data -from S3 to process it. External outputs and -[external dependencies](/doc/user-guide/external-dependencies) provide a way for -DVC to track data outside of the project. +such that its preferable to avoid moving it from its external/remote location. +For example data on a network attached storage (NAS) drive, processing data on +HDFS, running [Dask](https://dask.org/) via SSH, or having a script that streams +data from S3 to process it. -## Description +External outputs and +[external dependencies](/doc/user-guide/external-dependencies) provide ways to +track data outside of the project. -DVC can track files on an external location with `dvc add` or specify external -files or directories as outputs for `dvc.yaml` files. External outputs are -considered part of the (extended) DVC project: DVC will track changes in them, -and reflect this in `dvc status` for example. +## How external outputs work + +DVC can track existing files or directories on an external location with +`dvc add` (`out` field). It can also create external files or directories as +outputs for `dvc.yaml` files (only `outs` field, not metrics or plots). + +External outputs are considered part of the (extended) DVC project: DVC will +track changes in them, and reflect this in `dvc status` reports, for example. + +For cached external outputs (e.g. `dvc add`, `dvc run -o`), you will need to +[setup an external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache) +in the same external/remote file system first. Currently, the following types (protocols) of external outputs (and cache) are supported: - Amazon S3 +- Microsoft Azure Blob Storage - Google Cloud Storage - SSH - HDFS @@ -27,16 +36,6 @@ Currently, the following types (protocols) of external outputs (and > Note that these are a subset of the remote storage types supported by > `dvc remote`. -In order to specify an external output for a stage file, add them -to the stage in `dvc.yaml` normally (for example with the usual `-o` or `-O` -options of `dvc run`) but with the external path or URL to the file in question. -For cached external outputs (`-o`), you will need to -[setup an external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache) -in the same external/remote file system first. - -Please note that there is no support for external metrics or plots (`-m`, `-p`, -etc. options of `dvc run`). - > Avoid using the same [DVC remote](/doc/command-reference/remote) (used for > `dvc push`, `dvc pull`, etc.) for external outputs, because it may cause file > hash overlaps: the hash of an external output could collide with a hash @@ -44,42 +43,52 @@ etc. options of `dvc run`). ## Examples -For the examples, let's take a look at a [stage](/doc/command-reference/run) -that simply moves local file to an external location, producing the `.dvc` file -`data.txt.dvc`, that contains an external output. +For the examples, let's take a look at + +1. Adding a `dvc remote` to use as cache for data in the external location, and + configure it as external cache with `dvc config`. +2. Tracking existing data on an external location with `dvc add` (this doesn't + download it). This produces a `.dvc` file with an external output. +3. Creating a simple [stage](/doc/command-reference/run) that moves a local file + to the external location. This produces a stage with another external output + in `dvc.yaml`. ### Amazon S3 ```dvc -# Add S3 remote to be used as cache location for S3 files $ dvc remote add s3cache s3://mybucket/cache - -# Tell DVC to use the 's3cache' remote as S3 cache location $ dvc config cache.s3 s3cache -# Add data on S3 directly -$ dvc add --external s3://mybucket/mydata +$ dvc add --external s3://mybucket/existing-data -# Create the stage with an external S3 output $ dvc run -d data.txt \ --external \ -o s3://mybucket/data.txt \ aws s3 cp data.txt s3://mybucket/data.txt ``` +### Microsoft Azure Blob Storage + +```dvc +$ dvc remote add azurecache azure://mycontainer/cache +$ dvc config cache.azure azurecache + +$ dvc add --external azure://mycontainer/existing-data + +$ dvc run -d data.txt \ + --external \ + -o azure://mycontainer/data.txt \ + az storage blob upload -f data.txt -c mycontainer -n data.txt +``` + ### Google Cloud Storage ```dvc -# Add GS remote to be used as cache location for GS files $ dvc remote add gscache gs://mybucket/cache - -# Tell DVC to use the 'gscache' remote as GS cache location $ dvc config cache.gs gscache -# Add data on GS directly -$ dvc add --external gs://mybucket/mydata +$ dvc add --external gs://mybucket/existing-data -# Create the stage with an external GS output $ dvc run -d data.txt \ --external \ -o gs://mybucket/data.txt \ @@ -89,16 +98,11 @@ $ dvc run -d data.txt \ ### SSH ```dvc -# Add SSH remote to be used as cache location for SSH files $ dvc remote add sshcache ssh://user@example.com/cache - -# Tell DVC to use the 'sshcache' remote as SSH cache location $ dvc config cache.ssh sshcache -# Add data on SSH directly -$ dvc add --external ssh://user@example.com/mydata +$ dvc add --external ssh://user@example.com/existing-data -# Create the stage with an external SSH output $ dvc run -d data.txt \ --external \ -o ssh://user@example.com/data.txt \ @@ -107,8 +111,8 @@ $ dvc run -d data.txt \ > Please note that to use password authentication, it's necessary to set the > `password` or `ask_password` SSH remote options first (see -> `dvc remote modify`), and use the special URL in: -> `dvc add --external remote://sshcache/mydata`. +> `dvc remote modify`), and use a special `remote://` URL in step 2: +> `dvc add --external remote://sshcache/existing-data`. ⚠️ DVC requires both SSH and SFTP access to work with remote SSH locations. Please check that you are able to connect both ways with tools like `ssh` and @@ -119,16 +123,11 @@ Please check that you are able to connect both ways with tools like `ssh` and ### HDFS ```dvc -# Add HDFS remote to be used as cache location for HDFS files $ dvc remote add hdfscache hdfs://user@example.com/cache - -# Tell DVC to use the 'hdfscache' remote as HDFS cache location $ dvc config cache.hdfs hdfscache -# Add data on HDFS directly -$ dvc add --external hdfs://user@example.com/mydata +$ dvc add --external hdfs://user@example.com/existing-data -# Create the stage with an external HDFS output $ dvc run -d data.txt \ --external \ -o hdfs://user@example.com/data.txt \ @@ -142,8 +141,8 @@ it. So systems like Hadoop, Hive, and HBase are supported! ### Local file system path -The default cache location is `.dvc/cache`, so there is no need to move it for -local paths outside of your project. +The default cache is in `.dvc/cache`, so there is no need to set a +custom cache location for local paths outside of your project. > Except for external data on different storage devices or partitions mounted on > the same file system (e.g. `/mnt/raid/data`). In that case please setup an @@ -152,10 +151,8 @@ local paths outside of your project. > and avoid copying data. ```dvc -# Add data on an external location directly -$ dvc add --external /home/shared/mydata +$ dvc add --external /home/shared/existing-data -# Create the stage with an external location output $ dvc run -d data.txt \ --external \ -o /home/shared/data.txt \ From 8971e6cf0429fb295e476001e04976ff5e38ed21 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Mon, 31 Aug 2020 17:42:52 -0500 Subject: [PATCH 7/8] guide: put ".dvc" in back quotes per https://github.com/iterative/dvc.org/pull/1735#pullrequestreview-478981609 --- content/docs/user-guide/external-dependencies.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/docs/user-guide/external-dependencies.md b/content/docs/user-guide/external-dependencies.md index 3caf0daa94..fa7a3dbc3a 100644 --- a/content/docs/user-guide/external-dependencies.md +++ b/content/docs/user-guide/external-dependencies.md @@ -156,7 +156,7 @@ an external dependency (in this case an HTTPs URL).
-### Expand to see resulting .dvc file +### Expand to see resulting `.dvc` file ```yaml # ... @@ -195,7 +195,7 @@ specified (with the `repo` field).
-### Expand to see resulting .dvc file +### Expand to see resulting `.dvc` file ```yaml # ... From 23be15284715a83d0603924510ead4d1ec8f352e Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Mon, 31 Aug 2020 21:03:58 -0500 Subject: [PATCH 8/8] guide: remove "drive" from "NAS" per https://github.com/iterative/dvc.org/pull/1735#pullrequestreview-479125255 --- content/docs/user-guide/external-dependencies.md | 6 +++--- content/docs/user-guide/managing-external-data.md | 6 +++--- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/content/docs/user-guide/external-dependencies.md b/content/docs/user-guide/external-dependencies.md index fa7a3dbc3a..e3c25c3b06 100644 --- a/content/docs/user-guide/external-dependencies.md +++ b/content/docs/user-guide/external-dependencies.md @@ -2,9 +2,9 @@ There are cases when data is so large, or its processing is organized in a way such that you would like to avoid moving it out of its external/remote location. -For example from a network attached storage (NAS) drive, processing data on -HDFS, running [Dask](https://dask.org/) via SSH, or having a script that streams -data from S3 to process it. +For example from a network attached storage (NAS), processing data on HDFS, +running [Dask](https://dask.org/) via SSH, or having a script that streams data +from S3 to process it. External dependencies and [external outputs](/doc/user-guide/managing-external-data) provide ways to track diff --git a/content/docs/user-guide/managing-external-data.md b/content/docs/user-guide/managing-external-data.md index 854798d2ab..649d73b12a 100644 --- a/content/docs/user-guide/managing-external-data.md +++ b/content/docs/user-guide/managing-external-data.md @@ -2,9 +2,9 @@ There are cases when data is so large, or its processing is organized in a way such that its preferable to avoid moving it from its external/remote location. -For example data on a network attached storage (NAS) drive, processing data on -HDFS, running [Dask](https://dask.org/) via SSH, or having a script that streams -data from S3 to process it. +For example data on a network attached storage (NAS), processing data on HDFS, +running [Dask](https://dask.org/) via SSH, or having a script that streams data +from S3 to process it. External outputs and [external dependencies](/doc/user-guide/external-dependencies) provide ways to