From b350213bf33487563c452beb3a6ccb984556192b Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 13 Aug 2020 12:09:02 -0500 Subject: [PATCH 01/15] guide: copy edits to DVC Dirs & Files --- .../user-guide/dvc-files-and-directories.md | 33 +++++++++---------- 1 file changed, 16 insertions(+), 17 deletions(-) diff --git a/content/docs/user-guide/dvc-files-and-directories.md b/content/docs/user-guide/dvc-files-and-directories.md index 3748d948ab..6d818cfedc 100644 --- a/content/docs/user-guide/dvc-files-and-directories.md +++ b/content/docs/user-guide/dvc-files-and-directories.md @@ -5,26 +5,24 @@ directory (`.dvc/`) with the [internal directories and files](#internal-directories-and-files) needed for DVC operation. -Additionally, there are a few special kind of files created by certain -[DVC commands](/doc/command-reference): +Additionally, there are a few special kinds of files that support DVC's +features: - Files ending with the `.dvc` extension are placeholders to track data files - and directories. A DVC project usually has one - [`.dvc` file](#dvc-files) per large data file or dataset directory being - tracked. -- The [`dvc.yaml` file](#dvcyaml-file) or _pipeline(s) file_ specifies stages - that form the pipeline(s) of a project, and their connections (_dependency - graph_ or DAG). + and directories. A DVC project usually has one `.dvc` file per + large data file or dataset directory being tracked. +- `dvc.yaml` files (or _pipelines files_) specify stages that form the + pipeline(s) of a project, and how they connect (_dependency graph_ or DAG). - These typically come with a matching `dvc.lock` file to record the pipeline - state and track its data artifacts. + These typically have a matching `dvc.lock` file to record the pipeline state + and track its data artifacts. Both `.dvc` files and `dvc.yaml` use human-friendly YAML schemas, described below. We encourage you to get familiar with them so you may create, generate, and edit them on your own. -All these should be versioned with Git (in Git-enabled -repositories). +Both the internal directory and these special files should be versioned with Git +(in Git-enabled repositories). ## .dvc files @@ -150,8 +148,9 @@ the possible following fields: (the file's location). - `deps`: List of dependency file or directory paths of this stage (relative to `wdir` which defaults to the file's location) -- `params`: List of parameter dependency keys (field names) that - are read from a YAML, JSON, or TOML file (`params.yaml` by default). +- `params`: List of [parameter dependencies](/doc/command-reference/params). + These are key paths referring to a YAML, JSON or TOML file (`params.yaml` by + default). - `outs`: List of output file or directory paths of this stage (relative to `wdir` which defaults to the file's location), and optionally, whether or not this file or directory is cached (`true` by @@ -213,15 +212,15 @@ stages: Stage commands are listed again in `dvc.lock`, in order to know when their definitions change in the `dvc.yaml` file. -Regular dependencies and all types of outputs +Regular dependencies and all kinds of outputs (including [metrics](/doc/command-reference/metrics) and [plots](/doc/command-reference/plots) files) are also listed (per stage) in `dvc.lock`, but with an additional field to store the hash value of each file or directory tracked by DVC. Specifically: `md5`, `etag`, or `checksum` (same as in `deps` and `outs` entries of [`.dvc` files](#dvc-files)). -Full parameters (key and value) are listed separately under -`params`, grouped by parameters file. +[Parameter](/doc/command-reference/params#examples) key/value pairs are listed +separately under `params`, grouped by parameters file. ## Internal directories and files From 81bcdba8a715bb4929e29ab6d16d4d4a4180b902 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 13 Aug 2020 12:09:49 -0500 Subject: [PATCH 02/15] start: add link to CML.dev from Data Pipelines --- content/docs/start/data-pipelines.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/content/docs/start/data-pipelines.md b/content/docs/start/data-pipelines.md index efb7da34e4..d264153562 100644 --- a/content/docs/start/data-pipelines.md +++ b/content/docs/start/data-pipelines.md @@ -301,7 +301,8 @@ important problems: Storing these files in Git makes it easy to version and share. - _Continuous Delivery and Continuous Integration (CI/CD) for ML_ - describing projects in way that it can be reproduced (built) is the fist necessary step - before introducing CI/CD systems. + before introducing CI/CD systems. See our sister project, + [CML](https://cml.dev/) for some examples. ## Visualize From 91a46dffd1be2bb19c1b06e4062822017aca211f Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 13 Aug 2020 12:17:15 -0500 Subject: [PATCH 03/15] cmd: update URL desc in the Synopsis of a few refs --- content/docs/command-reference/get-url.md | 5 ++--- content/docs/command-reference/import-url.md | 5 ++--- content/docs/command-reference/remote/add.md | 5 ++--- 3 files changed, 6 insertions(+), 9 deletions(-) diff --git a/content/docs/command-reference/get-url.md b/content/docs/command-reference/get-url.md index 564e0963be..835eb18856 100644 --- a/content/docs/command-reference/get-url.md +++ b/content/docs/command-reference/get-url.md @@ -12,9 +12,8 @@ Download a file or directory from a supported URL (for example `s3://`, usage: dvc get-url [-h] [-q | -v] url [out] positional arguments: - url Location of the data to download. - See supported URLs below. - out Destination path to put files in + url (See supported URLs in the description.) + out Destination path to put files in. ``` ## Description diff --git a/content/docs/command-reference/import-url.md b/content/docs/command-reference/import-url.md index 012147ff1a..5a066c09d9 100644 --- a/content/docs/command-reference/import-url.md +++ b/content/docs/command-reference/import-url.md @@ -14,9 +14,8 @@ usage: dvc import-url [-h] [-q | -v] [--file ] [--no-exec] url [out] positional arguments: - url Location of the data to import. - See supported URLs below. - out Destination path to put files in + url (See supported URLs in the description.) + out Destination path to put files in. ``` ## Description diff --git a/content/docs/command-reference/remote/add.md b/content/docs/command-reference/remote/add.md index 147155a8c6..2fb2e97e23 100644 --- a/content/docs/command-reference/remote/add.md +++ b/content/docs/command-reference/remote/add.md @@ -12,9 +12,8 @@ usage: dvc remote add [-h] [--global | --system | --local] [-q | -v] [-d] [-f] name url positional arguments: - name Name of the remote - url Remote location. - See full list of supported URLs below. + name Name of the remote. + url (See supported URLs in the examples below.) ``` ## Description From eef628389c1e6120fba2a91af978bab5fcdab2c6 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 13 Aug 2020 12:33:24 -0500 Subject: [PATCH 04/15] docs: std. s3:// URLs everywhere --- content/docs/command-reference/config.md | 2 +- content/docs/command-reference/get-url.md | 6 +++--- content/docs/command-reference/import-url.md | 8 ++++---- content/docs/command-reference/remote/add.md | 16 ++++++++-------- content/docs/command-reference/remote/index.md | 6 +++--- content/docs/command-reference/remote/modify.md | 16 ++++++++-------- content/docs/command-reference/remote/remove.md | 2 +- content/docs/command-reference/remote/rename.md | 2 +- content/docs/command-reference/status.md | 2 +- content/docs/start/data-versioning.md | 2 +- content/docs/use-cases/data-registries.md | 2 +- .../use-cases/sharing-data-and-model-files.md | 4 ++-- 12 files changed, 34 insertions(+), 34 deletions(-) diff --git a/content/docs/command-reference/config.md b/content/docs/command-reference/config.md index 3ae8e04d35..6e0bfdcc91 100644 --- a/content/docs/command-reference/config.md +++ b/content/docs/command-reference/config.md @@ -205,7 +205,7 @@ to learn more about the state file (database) that is used for optimization. > [Create a Bucket](https://docs.aws.amazon.com/AmazonS3/latest/gsg/CreatingABucket.html). ```dvc -$ dvc remote add myremote s3://bucket/path +$ dvc remote add myremote s3://bucket/key $ dvc config core.remote myremote ``` diff --git a/content/docs/command-reference/get-url.md b/content/docs/command-reference/get-url.md index 835eb18856..2b6ffd1dc6 100644 --- a/content/docs/command-reference/get-url.md +++ b/content/docs/command-reference/get-url.md @@ -36,8 +36,8 @@ DVC supports several types of (local or) remote locations (protocols): | Type | Description | `url` format | | ------- | -------------- | ------------------------------------------ | | `local` | Local path | `/path/to/local/data` | -| `s3` | Amazon S3 | `s3://mybucket/data` | -| `gs` | Google Storage | `gs://mybucket/data` | +| `s3` | Amazon S3 | `s3://bucket/key` | +| `gs` | Google Storage | `gs://bucket/data` | | `ssh` | SSH server | `ssh://user@example.com:/path/to/data` | | `hdfs` | HDFS to file\* | `hdfs://user@example.com/path/to/data.csv` | | `http` | HTTP to file\* | `https://example.com/path/to/data.csv` | @@ -91,7 +91,7 @@ This command will copy an S3 object into the current working directory with the same file name: ```dvc -$ dvc get-url s3://bucket/path +$ dvc get-url s3://bucket/key ``` By default, DVC expects that AWS CLI is already diff --git a/content/docs/command-reference/import-url.md b/content/docs/command-reference/import-url.md index 5a066c09d9..008d2c78ac 100644 --- a/content/docs/command-reference/import-url.md +++ b/content/docs/command-reference/import-url.md @@ -58,13 +58,13 @@ DVC supports several types of (local or) remote locations (protocols): | Type | Description | `url` format | | -------- | --------------------------------------------------- | ------------------------------------------ | | `local` | Local path | `/path/to/local/data` | -| `s3` | Amazon S3 | `s3://mybucket/data` | -| `azure` | Microsoft Azure Blob Storage | `azure://my-container-name/path/to/data` | -| `gs` | Google Cloud Storage | `gs://mybucket/data` | +| `s3` | Amazon S3 | `s3://bucket/data` | +| `azure` | Microsoft Azure Blob Storage | `azure://container/path/to/data` | +| `gs` | Google Cloud Storage | `gs://bucket/data` | | `ssh` | SSH server | `ssh://user@example.com:/path/to/data` | | `hdfs` | HDFS to file (explanation below) | `hdfs://user@example.com/path/to/data.csv` | | `http` | HTTP to file with _strong ETag_ (explanation below) | `https://example.com/path/to/data.csv` | -| `remote` | Remote path (see explanation below) | `remote://myremote/path/to/data` | +| `remote` | Remote path (see explanation below) | `remote://remote-name/path/to/data` | > If you installed DVC via `pip` and plan to use cloud services as remote > storage, you might need to install these optional dependencies: `[s3]`, diff --git a/content/docs/command-reference/remote/add.md b/content/docs/command-reference/remote/add.md index 2fb2e97e23..4d47de14ad 100644 --- a/content/docs/command-reference/remote/add.md +++ b/content/docs/command-reference/remote/add.md @@ -93,7 +93,7 @@ The following are the types of remote storage (protocols) supported: > [Create a Bucket](https://docs.aws.amazon.com/AmazonS3/latest/gsg/CreatingABucket.html). ```dvc -$ dvc remote add -d s3remote url s3://my-bucket/my-key +$ dvc remote add -d s3remote url s3://mybucket/mykey ``` By default, DVC expects your AWS CLI is already @@ -133,7 +133,7 @@ must explicitly configure the `endpointurl`: For example: ```dvc -$ dvc remote add -d myremote s3://my-bucket/path/to/dir +$ dvc remote add -d myremote s3://mybucket/path/to/dir $ dvc remote modify myremote endpointurl \ https://object-storage.example.com ``` @@ -145,7 +145,7 @@ S3 remotes can also be configured entirely via environment variables: ```dvc $ export AWS_ACCESS_KEY_ID="" $ export AWS_SECRET_ACCESS_KEY="" -$ dvc remote add -d myremote s3://my-bucket/my/key +$ dvc remote add -d myremote s3://mybucket/my/key ``` For more information about the variables DVC supports, please visit @@ -414,7 +414,7 @@ region. > [Create a Bucket](https://docs.aws.amazon.com/AmazonS3/latest/gsg/CreatingABucket.html). ```dvc -$ dvc remote add -d myremote s3://mybucket/myproject +$ dvc remote add -d myremote s3://mybucket/mykey Setting 'myremote' as a default remote. $ dvc remote modify myremote region us-east-2 @@ -424,7 +424,7 @@ The project's config file (`.dvc/config`) now looks like this: ```ini ['remote "myremote"'] -url = s3://mybucket/myproject +url = s3://mybucket/mykey region = us-east-2 [core] remote = myremote @@ -434,13 +434,13 @@ The list of remotes should now be: ```dvc $ dvc remote list -myremote s3://mybucket/myproject +myremote s3://mybucket/mykey ``` You can overwrite existing remotes using `-f` with `dvc remote add`: ```dvc -$ dvc remote add -f myremote s3://mybucket/mynewproject +$ dvc remote add -f myremote s3://mybucket/another-key ``` List remotes again to view the updated remote: @@ -448,5 +448,5 @@ List remotes again to view the updated remote: ```dvc $ dvc remote list -myremote s3://mybucket/mynewproject +myremote s3://mybucket/another-key ``` diff --git a/content/docs/command-reference/remote/index.md b/content/docs/command-reference/remote/index.md index c1d516d9db..9c39ea1078 100644 --- a/content/docs/command-reference/remote/index.md +++ b/content/docs/command-reference/remote/index.md @@ -103,7 +103,7 @@ remote = myremote > [Create a Bucket](https://docs.aws.amazon.com/AmazonS3/latest/gsg/CreatingABucket.html). ```dvc -$ dvc remote add newremote s3://mybucket/myproject +$ dvc remote add newremote s3://mybucket/mykey $ dvc remote modify newremote endpointurl https://object-storage.example.com ``` @@ -115,7 +115,7 @@ url = /path/to/remote [core] remote = myremote ['remote "newremote"'] -url = s3://mybucket/myproject +url = s3://mybucket/mykey endpointurl = https://object-storage.example.com ``` @@ -124,7 +124,7 @@ endpointurl = https://object-storage.example.com ```dvc $ dvc remote list myremote /path/to/remote -newremote s3://mybucket/myproject +newremote s3://mybucket/mykey ``` ## Example: Change the name of a remote diff --git a/content/docs/command-reference/remote/modify.md b/content/docs/command-reference/remote/modify.md index bb8bd75744..21471ab7e4 100644 --- a/content/docs/command-reference/remote/modify.md +++ b/content/docs/command-reference/remote/modify.md @@ -66,7 +66,7 @@ The following config options are available for all remote types: below): ```dvc - $ dvc remote modify s3remote url s3://my-bucket/my/key + $ dvc remote modify s3remote url s3://mybucket/mykey ``` Or a _local remote_ (a directory in the file system): @@ -105,7 +105,7 @@ these settings, you could use the following options. - `url` - remote location, in the `s3:///` format: ```dvc - $ dvc remote modify myremote url s3://my-bucket/my/key + $ dvc remote modify myremote url s3://mybucket/my/key ``` - `region` - change S3 remote region: @@ -240,7 +240,7 @@ To communicate with a remote object storage that supports an S3 compatible API must explicitly configure the `endpointurl`: ```dvc -$ dvc remote add -d myremote s3://my-bucket/path/to/dir +$ dvc remote add -d myremote s3://mybucket/path/to/dir $ dvc remote modify myremote endpointurl \ https://object-storage.example.com ``` @@ -250,7 +250,7 @@ S3 remotes can also be configured entirely via environment variables: ```dvc $ export AWS_ACCESS_KEY_ID='' $ export AWS_SECRET_ACCESS_KEY='' -$ dvc remote add -d myremote s3://my-bucket/my/key +$ dvc remote add -d myremote s3://mybucket/my/key ``` For more information about the variables DVC supports, please visit @@ -712,22 +712,22 @@ Let's first set up a _default_ S3 remote. > [Create a Bucket](https://docs.aws.amazon.com/AmazonS3/latest/gsg/CreatingABucket.html). ```dvc -$ dvc remote add -d myremote s3://mybucket/myproject +$ dvc remote add -d myremote s3://mybucket/mykey Setting 'myremote' as a default remote. ``` Modify its access profile: ```dvc -$ dvc remote modify myremote profile myusername +$ dvc remote modify myremote profile myuser ``` Now the project config file should look like this: ```ini ['remote "myremote"'] -url = s3://mybucket/storage -profile = myusername +url = s3://mybucket/mykey +profile = myuser [core] remote = myremote ``` diff --git a/content/docs/command-reference/remote/remove.md b/content/docs/command-reference/remote/remove.md index c1939aa9cf..176f6bd4ad 100644 --- a/content/docs/command-reference/remote/remove.md +++ b/content/docs/command-reference/remote/remove.md @@ -46,7 +46,7 @@ The `name` argument is required. Add Amazon S3 remote: ```dvc -$ dvc remote add myremote s3://mybucket/myproject +$ dvc remote add myremote s3://mybucket/mykey ``` Remove it: diff --git a/content/docs/command-reference/remote/rename.md b/content/docs/command-reference/remote/rename.md index 7504492f5b..040336a8ec 100644 --- a/content/docs/command-reference/remote/rename.md +++ b/content/docs/command-reference/remote/rename.md @@ -50,7 +50,7 @@ DVC remote, respectively. Add Amazon S3 remote: ```dvc -$ dvc remote add myremote s3://mybucket/myproject +$ dvc remote add myremote s3://mybucket/mykey ``` Rename it: diff --git a/content/docs/command-reference/status.md b/content/docs/command-reference/status.md index 969ba884aa..868b346fe8 100644 --- a/content/docs/command-reference/status.md +++ b/content/docs/command-reference/status.md @@ -225,7 +225,7 @@ what files we have generated but haven't pushed to the remote yet: ```dvc $ dvc remote list -storage s3://dvc-remote +storage s3://bucket/key ``` And would like to check what files we have generated but haven't pushed to the diff --git a/content/docs/start/data-versioning.md b/content/docs/start/data-versioning.md index 5ad583b364..9fc967a106 100644 --- a/content/docs/start/data-versioning.md +++ b/content/docs/start/data-versioning.md @@ -85,7 +85,7 @@ retrieved on other environments later with `dvc pull`. First, we need to setup a storage: ```dvc -$ dvc remote add -d storage s3://my-bucket/dvc-storage +$ dvc remote add -d storage s3://mybucket/dvc-storage $ git commit .dvc/config -m "Configure remote storage" ``` diff --git a/content/docs/use-cases/data-registries.md b/content/docs/use-cases/data-registries.md index 3b014c7dab..44170ae2fe 100644 --- a/content/docs/use-cases/data-registries.md +++ b/content/docs/use-cases/data-registries.md @@ -82,7 +82,7 @@ The actual data is stored in the project's cache and can be be accessed from other locations or by other people: ```dvc -$ dvc remote add -d myremote s3://bucket/path +$ dvc remote add -d myremote s3://bucket/key $ dvc push ``` diff --git a/content/docs/use-cases/sharing-data-and-model-files.md b/content/docs/use-cases/sharing-data-and-model-files.md index 7b46bd91af..c78956de39 100644 --- a/content/docs/use-cases/sharing-data-and-model-files.md +++ b/content/docs/use-cases/sharing-data-and-model-files.md @@ -31,7 +31,7 @@ to the bucket where the data should be stored to the `dvc remote add` command. For example: ```dvc -$ dvc remote add -d myremote s3://mybucket/myproject +$ dvc remote add -d myremote s3://mybucket/mykey Setting 'myremote' as a default remote. ``` @@ -43,7 +43,7 @@ remote section for it: ```dvc ['remote "myremote"'] -url = s3://mybucket/myproject +url = s3://mybucket/mykey [core] remote = myremote ``` From e864a408263b5893a5c99e0bb18a0d1fb89dc543 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 13 Aug 2020 12:36:16 -0500 Subject: [PATCH 05/15] reorder external data examples to match remote refs --- .../docs/user-guide/external-dependencies.md | 36 +++++----- .../docs/user-guide/managing-external-data.md | 70 +++++++++---------- 2 files changed, 53 insertions(+), 53 deletions(-) diff --git a/content/docs/user-guide/external-dependencies.md b/content/docs/user-guide/external-dependencies.md index f9ee3eda93..7b7397c1d9 100644 --- a/content/docs/user-guide/external-dependencies.md +++ b/content/docs/user-guide/external-dependencies.md @@ -41,24 +41,6 @@ stage to your list of stages in dvc.yaml. > Note that some of these commands use the `/home/shared` directory, typical in > Linux distributions. -### Local file system path - -```dvc -$ dvc run -n download_file - -d /home/shared/data.txt \ - -o data.txt \ - cp /home/shared/data.txt data.txt -``` - -### SSH - -```dvc -$ dvc run -n download_file - -d ssh://user@example.com:/home/shared/data.txt \ - -o data.txt \ - scp user@example.com:/home/shared/data.txt data.txt -``` - ### Amazon S3 ```dvc @@ -90,6 +72,15 @@ $ dvc run -n download_file gsutil cp gs://mybucket/data.txt data.txt ``` +### SSH + +```dvc +$ dvc run -n download_file + -d ssh://user@example.com:/home/shared/data.txt \ + -o data.txt \ + scp user@example.com:/home/shared/data.txt data.txt +``` + ### HDFS ```dvc @@ -112,6 +103,15 @@ $ dvc run -n download_file wget https://example.com/data.txt -O data.txt ``` +### Local file system path + +```dvc +$ dvc run -n download_file + -d /home/shared/data.txt \ + -o data.txt \ + cp /home/shared/data.txt data.txt +``` + ## Example: DVC remote aliases If instead of a URL you'd like to use an alias that can be managed diff --git a/content/docs/user-guide/managing-external-data.md b/content/docs/user-guide/managing-external-data.md index 4449b620c0..3ccbfd7038 100644 --- a/content/docs/user-guide/managing-external-data.md +++ b/content/docs/user-guide/managing-external-data.md @@ -46,41 +46,6 @@ For the examples, let's take a look at a [stage](/doc/command-reference/run) that simply moves local file to an external location, producing a `data.txt.dvc` DVC-file. -### Local file system path - -The default local cache location is `.dvc/cache`, so there is no need to specify -it explicitly. - -```dvc -# Add data on an external location directly -$ dvc add --external /home/shared/mydata - -# Create the stage with an external location output -$ dvc run -d data.txt \ - --external \ - -o /home/shared/data.txt \ - cp data.txt /home/shared/data.txt -``` - -### SSH - -```dvc -# Add SSH remote to be used as cache location for SSH files -$ dvc remote add sshcache ssh://user@example.com:/cache - -# Tell DVC to use the 'sshcache' remote as SSH cache location -$ dvc config cache.ssh sshcache - -# Add data on SSH directly -$ dvc add --external ssh://user@example.com:/mydata - -# Create the stage with an external SSH output -$ dvc run -d data.txt \ - --external \ - -o ssh://user@example.com:/home/shared/data.txt \ - scp data.txt user@example.com:/home/shared/data.txt -``` - ### Amazon S3 ```dvc @@ -119,6 +84,25 @@ $ dvc run -d data.txt \ gsutil cp data.txt gs://mybucket/data.txt ``` +### SSH + +```dvc +# Add SSH remote to be used as cache location for SSH files +$ dvc remote add sshcache ssh://user@example.com:/cache + +# Tell DVC to use the 'sshcache' remote as SSH cache location +$ dvc config cache.ssh sshcache + +# Add data on SSH directly +$ dvc add --external ssh://user@example.com:/mydata + +# Create the stage with an external SSH output +$ dvc run -d data.txt \ + --external \ + -o ssh://user@example.com:/home/shared/data.txt \ + scp data.txt user@example.com:/home/shared/data.txt +``` + ### HDFS ```dvc @@ -142,3 +126,19 @@ $ dvc run -d data.txt \ Note that as long as there is a `hdfs://...` path for your data, DVC can handle it. So systems like Hadoop, Hive, and HBase are supported! + +### Local file system path + +The default local cache location is `.dvc/cache`, so there is no need to specify +it explicitly. + +```dvc +# Add data on an external location directly +$ dvc add --external /home/shared/mydata + +# Create the stage with an external location output +$ dvc run -d data.txt \ + --external \ + -o /home/shared/data.txt \ + cp data.txt /home/shared/data.txt +``` From db15a96dd86614ac0ba3636ae96fab9a19a43063 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 13 Aug 2020 13:36:44 -0500 Subject: [PATCH 06/15] cmd: update remote location (URL) lists in get/import-url et al. per https://github.com/iterative/dvc.org/pull/1695#pullrequestreview-466995549 and https://github.com/iterative/dvc.org/pull/1695#pullrequestreview-466995884 --- content/docs/command-reference/get-url.md | 57 +++++++++++-------- content/docs/command-reference/import-url.md | 28 ++++----- content/docs/command-reference/remote/add.md | 4 +- .../docs/command-reference/remote/modify.md | 2 +- .../docs/user-guide/external-dependencies.md | 4 +- 5 files changed, 54 insertions(+), 41 deletions(-) diff --git a/content/docs/command-reference/get-url.md b/content/docs/command-reference/get-url.md index 2b6ffd1dc6..d13e158cd8 100644 --- a/content/docs/command-reference/get-url.md +++ b/content/docs/command-reference/get-url.md @@ -33,14 +33,18 @@ directory will be placed inside. DVC supports several types of (local or) remote locations (protocols): -| Type | Description | `url` format | -| ------- | -------------- | ------------------------------------------ | -| `local` | Local path | `/path/to/local/data` | -| `s3` | Amazon S3 | `s3://bucket/key` | -| `gs` | Google Storage | `gs://bucket/data` | -| `ssh` | SSH server | `ssh://user@example.com:/path/to/data` | -| `hdfs` | HDFS to file\* | `hdfs://user@example.com/path/to/data.csv` | -| `http` | HTTP to file\* | `https://example.com/path/to/data.csv` | +| Type | Description | `url` format example | +| -------- | ---------------------------- | ------------------------------------------------------ | +| `s3` | Amazon S3 | `s3://bucket/key/to/data` | +| `azure` | Microsoft Azure Blob Storage | `azure://container/path/to/data` | +| `gdrive` | Google Drive | `gdrive:///data` | +| `gs` | Google Cloud Storage | `gs://bucket/path/to/data` | +| `ssh` | SSH server | `ssh://user@example.com:/path/to/data` | +| `hdfs` | HDFS to file\* | `hdfs://user@example.com/path/to/data.csv` | +| `http` | HTTP to file\* | `https://example.com/path/to/data.csv` | +| `webdav` | WebDav to file\* | `webdavs://example.com/public.php/webdav/path/to/data` | +| `local` | Local path | `/path/to/local/data` | +| `remote` | Remote path\* | `remote://remote-name/path/to/data` | > If you installed DVC via `pip` and plan to use cloud services as remote > storage, you might need to install these optional dependencies: `[s3]`, @@ -48,8 +52,15 @@ DVC supports several types of (local or) remote locations (protocols): > include them all. The command should look like this: `pip install "dvc[s3]"`. > (This example installs `boto3` library along with DVC to support S3 storage.) -\* HDFS and HTTP **do not** support downloading entire directories, only single -files. +\* Notes on remote locations: + +- HDFS, HTTP, and WebDav **do not** support downloading entire directories, only + single files. + +- `remote://myremote/path/to/file` notation just means that a DVC + [remote](/doc/command-reference/remote) `myremote` is defined and when DVC is + running. DVC automatically expands this URL into a regular S3, SSH, GS, etc + URL by appending `/path/to/file` to the `myremote`'s configured base path. Another way to understand the `dvc get-url` command is as a tool for downloading data files. On GNU/Linux systems for example, instead of `dvc get-url` with @@ -72,19 +83,6 @@ $ wget https://example.com/path/to/data.csv
-### Click and expand for a local example - -```dvc -$ dvc get-url /local/path/to/data -``` - -The above command will copy the `/local/path/to/data` file or directory into -`./dir`. - -
- -
- ### Click for Amazon S3 example This command will copy an S3 object into the current working directory with the @@ -156,3 +154,16 @@ $ dvc get-url https://example.com/path/to/file ```
+ +### Click and expand for a local example + +```dvc +$ dvc get-url /local/path/to/data +``` + +The above command will copy the `/local/path/to/data` file or directory into +`./dir`. + + + +
diff --git a/content/docs/command-reference/import-url.md b/content/docs/command-reference/import-url.md index 008d2c78ac..7e37d902a4 100644 --- a/content/docs/command-reference/import-url.md +++ b/content/docs/command-reference/import-url.md @@ -55,16 +55,18 @@ source. DVC supports several types of (local or) remote locations (protocols): -| Type | Description | `url` format | -| -------- | --------------------------------------------------- | ------------------------------------------ | -| `local` | Local path | `/path/to/local/data` | -| `s3` | Amazon S3 | `s3://bucket/data` | -| `azure` | Microsoft Azure Blob Storage | `azure://container/path/to/data` | -| `gs` | Google Cloud Storage | `gs://bucket/data` | -| `ssh` | SSH server | `ssh://user@example.com:/path/to/data` | -| `hdfs` | HDFS to file (explanation below) | `hdfs://user@example.com/path/to/data.csv` | -| `http` | HTTP to file with _strong ETag_ (explanation below) | `https://example.com/path/to/data.csv` | -| `remote` | Remote path (see explanation below) | `remote://remote-name/path/to/data` | +| Type | Description | `url` format example | +| -------- | --------------------------------- | ------------------------------------------------------ | +| `s3` | Amazon S3 | `s3://bucket/key/to/data` | +| `azure` | Microsoft Azure Blob Storage | `azure://container/path/to/data` | +| `gdrive` | Google Drive | `gdrive:///data` | +| `gs` | Google Cloud Storage | `gs://bucket/path/to/data` | +| `ssh` | SSH server | `ssh://user@example.com:/path/to/data` | +| `hdfs` | HDFS to file\* | `hdfs://user@example.com/path/to/data.csv` | +| `http` | HTTP to file with _strong ETag_\* | `https://example.com/path/to/data.csv` | +| `webdav` | WebDav to file\* | `webdavs://example.com/public.php/webdav/path/to/data` | +| `local` | Local path | `/path/to/local/data` | +| `remote` | Remote path\* | `remote://remote-name/path/to/data` | > If you installed DVC via `pip` and plan to use cloud services as remote > storage, you might need to install these optional dependencies: `[s3]`, @@ -72,10 +74,10 @@ DVC supports several types of (local or) remote locations (protocols): > include them all. The command should look like this: `pip install "dvc[s3]"`. > (This example installs `boto3` library along with DVC to support S3 storage.) -Specific explanations: +\* Notes on remote locations: -- HDFS and HTTP **do not** support downloading entire directories, only single - files. +- HDFS, HTTP, and WebDav **do not** support downloading entire directories, only + single files. - In case of HTTP, [strong ETag](https://en.wikipedia.org/wiki/HTTP_ETag#Strong_and_weak_validation) diff --git a/content/docs/command-reference/remote/add.md b/content/docs/command-reference/remote/add.md index 4d47de14ad..be649827d7 100644 --- a/content/docs/command-reference/remote/add.md +++ b/content/docs/command-reference/remote/add.md @@ -158,7 +158,7 @@ For more information about the variables DVC supports, please visit ### Click for Microsoft Azure Blob Storage ```dvc -$ dvc remote add -d myremote azure://my-container-name/path +$ dvc remote add -d myremote azure://mycontainer/path $ dvc remote modify --local myremote connection_string \ 'my-connection-string' ``` @@ -172,7 +172,7 @@ variables: ```dvc $ export AZURE_STORAGE_CONNECTION_STRING='' -$ export AZURE_STORAGE_CONTAINER_NAME='my-container-name' +$ export AZURE_STORAGE_CONTAINER_NAME='mycontainer' $ dvc remote add -d myremote 'azure://' ``` diff --git a/content/docs/command-reference/remote/modify.md b/content/docs/command-reference/remote/modify.md index 21471ab7e4..19604647d4 100644 --- a/content/docs/command-reference/remote/modify.md +++ b/content/docs/command-reference/remote/modify.md @@ -265,7 +265,7 @@ For more information about the variables DVC supports, please visit - `url` - remote location, in the `azure:///` format: ```dvc - $ dvc remote modify myremote url azure://my-container-name/path + $ dvc remote modify myremote url azure://mycontainer/path ``` - `connection_string` - connection string: diff --git a/content/docs/user-guide/external-dependencies.md b/content/docs/user-guide/external-dependencies.md index 7b7397c1d9..01082f8da6 100644 --- a/content/docs/user-guide/external-dependencies.md +++ b/content/docs/user-guide/external-dependencies.md @@ -54,12 +54,12 @@ $ dvc run -n download_file ```dvc $ dvc run -n download_file - -d azure://my-container-name/data.txt \ + -d azure://mycontainer/data.txt \ -o data.txt \ az storage copy \ -d data.json \ --source-account-name my-account \ - --source-container my-container-name \ + --source-container mycontainer \ --source-blob data.txt ``` From 64b2a9cafb593e8c2dddb8c39799e2f5b8453a5b Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 13 Aug 2020 20:12:01 -0500 Subject: [PATCH 07/15] guide: add notes about external mounts for "local remote" locations --- content/docs/user-guide/external-dependencies.md | 5 +++++ content/docs/user-guide/managing-external-data.md | 10 ++++++++-- 2 files changed, 13 insertions(+), 2 deletions(-) diff --git a/content/docs/user-guide/external-dependencies.md b/content/docs/user-guide/external-dependencies.md index 01082f8da6..5bf4bbb96f 100644 --- a/content/docs/user-guide/external-dependencies.md +++ b/content/docs/user-guide/external-dependencies.md @@ -105,6 +105,11 @@ $ dvc run -n download_file ### Local file system path +For local paths outside of your project: + +> This includes different storage devices or partitions mounted on the same file +> system, e.g. `/mnt/raid/data`. + ```dvc $ dvc run -n download_file -d /home/shared/data.txt \ diff --git a/content/docs/user-guide/managing-external-data.md b/content/docs/user-guide/managing-external-data.md index 3ccbfd7038..4d99f42f62 100644 --- a/content/docs/user-guide/managing-external-data.md +++ b/content/docs/user-guide/managing-external-data.md @@ -129,8 +129,14 @@ it. So systems like Hadoop, Hive, and HBase are supported! ### Local file system path -The default local cache location is `.dvc/cache`, so there is no need to specify -it explicitly. +The default cache location is `.dvc/cache`, so there is no need to move it for +local paths outside of your project. + +> Except for external data on different storage devices or partitions mounted on +> the same file system (e.g. `/mnt/raid/data`). In that case please setup an +> external cache in that same drive to enable +> [file links](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache) +> and avoid copying data. ```dvc # Add data on an external location directly From 4619506c5f83870eff98cc8f3de6e02744341f14 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 20 Aug 2020 18:24:56 -0500 Subject: [PATCH 08/15] s3: key -> path per https://github.com/iterative/dvc.org/pull/1695#pullrequestreview-467253289 --- content/docs/command-reference/config.md | 2 +- content/docs/command-reference/get-url.md | 2 +- content/docs/command-reference/remote/add.md | 2 +- content/docs/command-reference/remote/modify.md | 4 ++-- content/docs/command-reference/status.md | 2 +- content/docs/use-cases/data-registries.md | 2 +- 6 files changed, 7 insertions(+), 7 deletions(-) diff --git a/content/docs/command-reference/config.md b/content/docs/command-reference/config.md index b06317f609..6ae25fcb95 100644 --- a/content/docs/command-reference/config.md +++ b/content/docs/command-reference/config.md @@ -205,7 +205,7 @@ to learn more about the state file (database) that is used for optimization. > [Create a Bucket](https://docs.aws.amazon.com/AmazonS3/latest/gsg/CreatingABucket.html). ```dvc -$ dvc remote add myremote s3://bucket/key +$ dvc remote add myremote s3://bucket/path $ dvc config core.remote myremote ``` diff --git a/content/docs/command-reference/get-url.md b/content/docs/command-reference/get-url.md index b95572082a..e4ee535110 100644 --- a/content/docs/command-reference/get-url.md +++ b/content/docs/command-reference/get-url.md @@ -88,7 +88,7 @@ This command will copy an S3 object into the current working directory with the same file name: ```dvc -$ dvc get-url s3://bucket/key +$ dvc get-url s3://bucket/path ``` By default, DVC expects that AWS CLI is already diff --git a/content/docs/command-reference/remote/add.md b/content/docs/command-reference/remote/add.md index b028c1842b..8937fd8820 100644 --- a/content/docs/command-reference/remote/add.md +++ b/content/docs/command-reference/remote/add.md @@ -145,7 +145,7 @@ S3 remotes can also be configured entirely via environment variables: ```dvc $ export AWS_ACCESS_KEY_ID="" $ export AWS_SECRET_ACCESS_KEY="" -$ dvc remote add -d myremote s3://mybucket/my/key +$ dvc remote add -d myremote s3://mybucket/my/path ``` For more information about the variables DVC supports, please visit diff --git a/content/docs/command-reference/remote/modify.md b/content/docs/command-reference/remote/modify.md index d8d2e9f220..2c952c377c 100644 --- a/content/docs/command-reference/remote/modify.md +++ b/content/docs/command-reference/remote/modify.md @@ -105,7 +105,7 @@ these settings, you could use the following options. - `url` - remote location, in the `s3:///` format: ```dvc - $ dvc remote modify myremote url s3://mybucket/my/key + $ dvc remote modify myremote url s3://mybucket/my/path ``` - `region` - change S3 remote region: @@ -250,7 +250,7 @@ S3 remotes can also be configured entirely via environment variables: ```dvc $ export AWS_ACCESS_KEY_ID='' $ export AWS_SECRET_ACCESS_KEY='' -$ dvc remote add -d myremote s3://mybucket/my/key +$ dvc remote add -d myremote s3://mybucket/my/path ``` For more information about the variables DVC supports, please visit diff --git a/content/docs/command-reference/status.md b/content/docs/command-reference/status.md index 4fff05ac88..a31e00195c 100644 --- a/content/docs/command-reference/status.md +++ b/content/docs/command-reference/status.md @@ -227,7 +227,7 @@ what files we have generated but haven't pushed to the remote yet: ```dvc $ dvc remote list -storage s3://bucket/key +storage s3://bucket/path ``` And would like to check what files we have generated but haven't pushed to the diff --git a/content/docs/use-cases/data-registries.md b/content/docs/use-cases/data-registries.md index 44170ae2fe..3b014c7dab 100644 --- a/content/docs/use-cases/data-registries.md +++ b/content/docs/use-cases/data-registries.md @@ -82,7 +82,7 @@ The actual data is stored in the project's cache and can be be accessed from other locations or by other people: ```dvc -$ dvc remote add -d myremote s3://bucket/key +$ dvc remote add -d myremote s3://bucket/path $ dvc push ``` From fe810b16284288590ccc5ac2e34eca001916324c Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 20 Aug 2020 20:05:38 -0500 Subject: [PATCH 09/15] s3: *key -> path in sample URLs per https://github.com/iterative/dvc.org/pull/1695#pullrequestreview-467253289 --- content/docs/command-reference/remote/add.md | 12 ++++++------ content/docs/command-reference/remote/index.md | 6 +++--- content/docs/command-reference/remote/modify.md | 6 +++--- content/docs/command-reference/remote/remove.md | 2 +- content/docs/command-reference/remote/rename.md | 2 +- .../docs/use-cases/sharing-data-and-model-files.md | 4 ++-- 6 files changed, 16 insertions(+), 16 deletions(-) diff --git a/content/docs/command-reference/remote/add.md b/content/docs/command-reference/remote/add.md index 8937fd8820..64c8087664 100644 --- a/content/docs/command-reference/remote/add.md +++ b/content/docs/command-reference/remote/add.md @@ -93,7 +93,7 @@ The following are the types of remote storage (protocols) supported: > [Create a Bucket](https://docs.aws.amazon.com/AmazonS3/latest/gsg/CreatingABucket.html). ```dvc -$ dvc remote add -d s3remote url s3://mybucket/mykey +$ dvc remote add -d s3remote url s3://mybucket/path ``` By default, DVC expects your AWS CLI is already @@ -409,7 +409,7 @@ region. > [Create a Bucket](https://docs.aws.amazon.com/AmazonS3/latest/gsg/CreatingABucket.html). ```dvc -$ dvc remote add -d myremote s3://mybucket/mykey +$ dvc remote add -d myremote s3://mybucket/path Setting 'myremote' as a default remote. $ dvc remote modify myremote region us-east-2 @@ -419,7 +419,7 @@ The project's config file (`.dvc/config`) now looks like this: ```ini ['remote "myremote"'] -url = s3://mybucket/mykey +url = s3://mybucket/path region = us-east-2 [core] remote = myremote @@ -429,13 +429,13 @@ The list of remotes should now be: ```dvc $ dvc remote list -myremote s3://mybucket/mykey +myremote s3://mybucket/path ``` You can overwrite existing remotes using `-f` with `dvc remote add`: ```dvc -$ dvc remote add -f myremote s3://mybucket/another-key +$ dvc remote add -f myremote s3://mybucket/another-path ``` List remotes again to view the updated remote: @@ -443,5 +443,5 @@ List remotes again to view the updated remote: ```dvc $ dvc remote list -myremote s3://mybucket/another-key +myremote s3://mybucket/another-path ``` diff --git a/content/docs/command-reference/remote/index.md b/content/docs/command-reference/remote/index.md index 9c39ea1078..6aa54dee17 100644 --- a/content/docs/command-reference/remote/index.md +++ b/content/docs/command-reference/remote/index.md @@ -103,7 +103,7 @@ remote = myremote > [Create a Bucket](https://docs.aws.amazon.com/AmazonS3/latest/gsg/CreatingABucket.html). ```dvc -$ dvc remote add newremote s3://mybucket/mykey +$ dvc remote add newremote s3://mybucket/path $ dvc remote modify newremote endpointurl https://object-storage.example.com ``` @@ -115,7 +115,7 @@ url = /path/to/remote [core] remote = myremote ['remote "newremote"'] -url = s3://mybucket/mykey +url = s3://mybucket/path endpointurl = https://object-storage.example.com ``` @@ -124,7 +124,7 @@ endpointurl = https://object-storage.example.com ```dvc $ dvc remote list myremote /path/to/remote -newremote s3://mybucket/mykey +newremote s3://mybucket/path ``` ## Example: Change the name of a remote diff --git a/content/docs/command-reference/remote/modify.md b/content/docs/command-reference/remote/modify.md index 2c952c377c..ba006f6506 100644 --- a/content/docs/command-reference/remote/modify.md +++ b/content/docs/command-reference/remote/modify.md @@ -66,7 +66,7 @@ The following config options are available for all remote types: below): ```dvc - $ dvc remote modify s3remote url s3://mybucket/mykey + $ dvc remote modify s3remote url s3://mybucket/path ``` Or a _local remote_ (a directory in the file system): @@ -717,7 +717,7 @@ Let's first set up a _default_ S3 remote. > [Create a Bucket](https://docs.aws.amazon.com/AmazonS3/latest/gsg/CreatingABucket.html). ```dvc -$ dvc remote add -d myremote s3://mybucket/mykey +$ dvc remote add -d myremote s3://mybucket/path Setting 'myremote' as a default remote. ``` @@ -731,7 +731,7 @@ Now the project config file should look like this: ```ini ['remote "myremote"'] -url = s3://mybucket/mykey +url = s3://mybucket/path profile = myuser [core] remote = myremote diff --git a/content/docs/command-reference/remote/remove.md b/content/docs/command-reference/remote/remove.md index 176f6bd4ad..12bfa6eb36 100644 --- a/content/docs/command-reference/remote/remove.md +++ b/content/docs/command-reference/remote/remove.md @@ -46,7 +46,7 @@ The `name` argument is required. Add Amazon S3 remote: ```dvc -$ dvc remote add myremote s3://mybucket/mykey +$ dvc remote add myremote s3://mybucket/path ``` Remove it: diff --git a/content/docs/command-reference/remote/rename.md b/content/docs/command-reference/remote/rename.md index 040336a8ec..06f61c8aac 100644 --- a/content/docs/command-reference/remote/rename.md +++ b/content/docs/command-reference/remote/rename.md @@ -50,7 +50,7 @@ DVC remote, respectively. Add Amazon S3 remote: ```dvc -$ dvc remote add myremote s3://mybucket/mykey +$ dvc remote add myremote s3://mybucket/path ``` Rename it: diff --git a/content/docs/use-cases/sharing-data-and-model-files.md b/content/docs/use-cases/sharing-data-and-model-files.md index c78956de39..e2930f470b 100644 --- a/content/docs/use-cases/sharing-data-and-model-files.md +++ b/content/docs/use-cases/sharing-data-and-model-files.md @@ -31,7 +31,7 @@ to the bucket where the data should be stored to the `dvc remote add` command. For example: ```dvc -$ dvc remote add -d myremote s3://mybucket/mykey +$ dvc remote add -d myremote s3://mybucket/path Setting 'myremote' as a default remote. ``` @@ -43,7 +43,7 @@ remote section for it: ```dvc ['remote "myremote"'] -url = s3://mybucket/mykey +url = s3://mybucket/path [core] remote = myremote ``` From eaef0919063a67417c2c96218200271d4109db9e Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 20 Aug 2020 20:13:46 -0500 Subject: [PATCH 10/15] cmd: roll back unnecessary change in remote modify --- content/docs/command-reference/remote/modify.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/docs/command-reference/remote/modify.md b/content/docs/command-reference/remote/modify.md index ba006f6506..3de966a489 100644 --- a/content/docs/command-reference/remote/modify.md +++ b/content/docs/command-reference/remote/modify.md @@ -724,7 +724,7 @@ Setting 'myremote' as a default remote. Modify its access profile: ```dvc -$ dvc remote modify myremote profile myuser +$ dvc remote modify myremote profile myusername ``` Now the project config file should look like this: @@ -732,7 +732,7 @@ Now the project config file should look like this: ```ini ['remote "myremote"'] url = s3://mybucket/path -profile = myuser +profile = myusername [core] remote = myremote ``` From aa87054f51efd843d3cfbe3fa69f49e8efb26a05 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 20 Aug 2020 20:15:22 -0500 Subject: [PATCH 11/15] start: roll back link to CML for now... --- content/docs/start/data-pipelines.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/content/docs/start/data-pipelines.md b/content/docs/start/data-pipelines.md index d264153562..efb7da34e4 100644 --- a/content/docs/start/data-pipelines.md +++ b/content/docs/start/data-pipelines.md @@ -301,8 +301,7 @@ important problems: Storing these files in Git makes it easy to version and share. - _Continuous Delivery and Continuous Integration (CI/CD) for ML_ - describing projects in way that it can be reproduced (built) is the fist necessary step - before introducing CI/CD systems. See our sister project, - [CML](https://cml.dev/) for some examples. + before introducing CI/CD systems. ## Visualize From ee0f42a31e61dfe67e26196c705be3895ee7c5cf Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 20 Aug 2020 20:18:26 -0500 Subject: [PATCH 12/15] start: roll back S3 URL in Data Versioning --- content/docs/start/data-versioning.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/start/data-versioning.md b/content/docs/start/data-versioning.md index 5dde610ca2..d328165545 100644 --- a/content/docs/start/data-versioning.md +++ b/content/docs/start/data-versioning.md @@ -86,7 +86,7 @@ retrieved on other environments later with `dvc pull`. First, we need to setup a storage: ```dvc -$ dvc remote add -d storage s3://mybucket/dvc-storage +$ dvc remote add -d storage s3://my-bucket/dvc-storage $ git commit .dvc/config -m "Configure remote storage" ``` From 500ac1431bfe03116d388b14de6e31cac45a63b6 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 20 Aug 2020 20:25:32 -0500 Subject: [PATCH 13/15] guide: roll back changes to DVC metafiled for now... --- .../user-guide/dvc-files-and-directories.md | 33 ++++++++++--------- 1 file changed, 17 insertions(+), 16 deletions(-) diff --git a/content/docs/user-guide/dvc-files-and-directories.md b/content/docs/user-guide/dvc-files-and-directories.md index 2347c3301e..d27979d068 100644 --- a/content/docs/user-guide/dvc-files-and-directories.md +++ b/content/docs/user-guide/dvc-files-and-directories.md @@ -5,24 +5,26 @@ directory (`.dvc/`) with the [internal directories and files](#internal-directories-and-files) needed for DVC operation. -Additionally, there are a few special kinds of files that support DVC's -features: +Additionally, there are a few special kind of files created by certain +[DVC commands](/doc/command-reference): - Files ending with the `.dvc` extension are placeholders to track data files - and directories. A DVC project usually has one `.dvc` file per - large data file or dataset directory being tracked. -- `dvc.yaml` files (or _pipelines files_) specify stages that form the - pipeline(s) of a project, and how they connect (_dependency graph_ or DAG). + and directories. A DVC project usually has one + [`.dvc` file](#dvc-files) per large data file or dataset directory being + tracked. +- The [`dvc.yaml` file](#dvcyaml-file) or _pipeline(s) file_ specifies stages + that form the pipeline(s) of a project, and their connections (_dependency + graph_ or DAG). - These typically have a matching `dvc.lock` file to record the pipeline state - and track its data artifacts. + These typically come with a matching `dvc.lock` file to record the pipeline + state and track its data artifacts. Both `.dvc` files and `dvc.yaml` use human-friendly YAML schemas, described below. We encourage you to get familiar with them so you may create, generate, and edit them on your own. -Both the internal directory and these special files should be versioned with Git -(in Git-enabled repositories). +All these should be versioned with Git (in Git-enabled +repositories). ## .dvc files @@ -148,9 +150,8 @@ the possible following fields: (the file's location). - `deps`: List of dependency file or directory paths of this stage (relative to `wdir` which defaults to the file's location) -- `params`: List of [parameter dependencies](/doc/command-reference/params). - These are key paths referring to a YAML, JSON or TOML file (`params.yaml` by - default). +- `params`: List of parameter dependency keys (field names) that + are read from a YAML, JSON, or TOML file (`params.yaml` by default). - `outs`: List of output file or directory paths of this stage (relative to `wdir` which defaults to the file's location), and optionally, whether or not this file or directory is cached (`true` by @@ -218,15 +219,15 @@ stages: Stage commands are listed again in `dvc.lock`, in order to know when their definitions change in the `dvc.yaml` file. -Regular dependencies and all kinds of outputs +Regular dependencies and all types of outputs (including [metrics](/doc/command-reference/metrics) and [plots](/doc/command-reference/plots) files) are also listed (per stage) in `dvc.lock`, but with an additional field to store the hash value of each file or directory tracked by DVC. Specifically: `md5`, `etag`, or `checksum` (same as in `deps` and `outs` entries of [`.dvc` files](#dvc-files)). -[Parameter](/doc/command-reference/params#examples) key/value pairs are listed -separately under `params`, grouped by parameters file. +Full parameters (key and value) are listed separately under +`params`, grouped by parameters file. ## Internal directories and files From 1805191960064cbb56ead61872daf88a9b352582 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 20 Aug 2020 20:30:03 -0500 Subject: [PATCH 14/15] guide: roll back unrelated note in x deps path (for now) --- content/docs/user-guide/external-dependencies.md | 5 ----- 1 file changed, 5 deletions(-) diff --git a/content/docs/user-guide/external-dependencies.md b/content/docs/user-guide/external-dependencies.md index 4200d8959a..c23a486f12 100644 --- a/content/docs/user-guide/external-dependencies.md +++ b/content/docs/user-guide/external-dependencies.md @@ -110,11 +110,6 @@ $ dvc run -n download_file ### Local file system path -For local paths outside of your project: - -> This includes different storage devices or partitions mounted on the same file -> system, e.g. `/mnt/raid/data`. - ```dvc $ dvc run -n download_file -d /home/shared/data.txt \ From 2fe9b7523a73bcd28d5dce7e6694c7c6aed3b161 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 20 Aug 2020 21:33:17 -0500 Subject: [PATCH 15/15] guide: roll back unrelated changes to x outs page (for now) per https://github.com/iterative/dvc.org/pull/1695#pullrequestreview-467254617 --- content/docs/user-guide/managing-external-data.md | 10 ++-------- 1 file changed, 2 insertions(+), 8 deletions(-) diff --git a/content/docs/user-guide/managing-external-data.md b/content/docs/user-guide/managing-external-data.md index 49f49897ad..3d18e1a4f1 100644 --- a/content/docs/user-guide/managing-external-data.md +++ b/content/docs/user-guide/managing-external-data.md @@ -135,14 +135,8 @@ it. So systems like Hadoop, Hive, and HBase are supported! ### Local file system path -The default cache location is `.dvc/cache`, so there is no need to move it for -local paths outside of your project. - -> Except for external data on different storage devices or partitions mounted on -> the same file system (e.g. `/mnt/raid/data`). In that case please setup an -> external cache in that same drive to enable -> [file links](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache) -> and avoid copying data. +The default local cache location is `.dvc/cache`, so there is no need to specify +it explicitly. ```dvc # Add data on an external location directly