From 5f0da08d7635d195e15b7c81c98c7eda7633318c Mon Sep 17 00:00:00 2001 From: Ruslan Kuprieiev Date: Fri, 17 Jan 2020 06:23:44 +0200 Subject: [PATCH 01/27] add docs for dvc metrics diff --- .../docs/command-reference/metrics/diff.md | 106 ++++++++++++++++++ 1 file changed, 106 insertions(+) create mode 100644 public/static/docs/command-reference/metrics/diff.md diff --git a/public/static/docs/command-reference/metrics/diff.md b/public/static/docs/command-reference/metrics/diff.md new file mode 100644 index 0000000000..c799270dab --- /dev/null +++ b/public/static/docs/command-reference/metrics/diff.md @@ -0,0 +1,106 @@ +# metrics diff + +Find and print [project metrics](/doc/command-reference/metrics) changes between +commits, commit and a working tree, etc. + +## Synopsis + +```usage +usage: dvc metrics diff [-h] [-q | -v] [--targets [TARGETS [TARGETS ...]]] + [-t TYPE] [-x XPATH] [-R] [--show-json] + [a_ref] [b_ref] + +positional arguments: + a_ref Git reference from which diff is calculated. If + omitted `HEAD`(latest commit) is used. + b_ref Git reference to which diff is calculated. If omitted + current working tree is used. +``` + +## Description + +Finds and prints changes between commits for all metrics in the +project by examining all of its +[DVC-files](/doc/user-guide/dvc-file-format). If `--targets` are provided, it +will show changes for those specific metric files instead. + +The optional `--targets` argument can contain several metric files. With the +`-R` option, a target can even be a directory, so that DVC recursively shows +changes for all metric files in it. + +Providing a `type` (`-t` option) overwrites the full metric specification (both +`type` and `xpath` fields) defined in the +[DVC-file](/doc/user-guide/dvc-file-format) (usually set originally with the +`dvc metrics modify` command). + +If `type` (via `-t`) is not specified and only `xpath` (`-x` option) is, only +the `xpath` field is overwritten in its DVC-file. (DVC will first try to read +`type` from the DVC-file, but it can be automatically detected by the file +extension.) + +> Alternatively, see `dvc metrics modify` command to learn how to apply `-t` and +> `-x` permanently. + +## Options + +- `--targets` - metric files or directories (see -R) to show changes for. If not + specified, will show changes for all metric files, if not specified. + +- `-t`, `--type` - specify a type of the metric file. Accepted values are: + `raw`, `json`, `tsv`, `htsv`, `csv`, `hcsv`. It will be used to determine + appropriate parsing and displaying format for this metric file and will + override `type` defined in the corresponding DVC-file See `dvc metrics modify` + for a full description of acceptable types. If no `type` is specified either + as a CLI `-t|--type` nor in the corresponding DVC-file itself, + `dvc metrics diff` will try to detect it on-the-fly. + +- `-x`, `--xpath` - specify a path within a metric file to get a specific metric + value to show changes for. If ommited, will show changes for all possible + paths. It will override `xpath` defined in the correspodning DVC-file. See + `dvc metrics modify` for a full description of `xpath` when applied to + specific metric types. + + If multiple metric files exist in the project, the same parser + and path will be applied to all of them. If `xpath` for a particular metric + has been set using `dvc metrics modify`, the path passed with this option will + overwrite it for the current command run only – It may fail to produce any + results or parse files that are not in a corresponding format in this case. + +- `-R`, `--recursive` - `path` is expected to be a directory for this option to + have effect. Determines the metric files to show changes for by searching each + target directory and its subdirectories for DVC-files to inspect. + +- `--show-json` - prints diff in easilly parsable JSON format instead of + human-readable table. + +- `-h`, `--help` - prints the usage/help message, and exit. + +- `-q`, `--quiet` - do not write anything to standard output. Exit with 0 if no + problems arise, otherwise 1. + +- `-v`, `--verbose` - displays detailed tracing information. + +## Examples + +Let's create a metrics file using a dummy command and commit it to git: + +``` +$ dvc run -M metrics.json 'echo "{\"AUC\": 0.5}" > metrics.json' +$ git commit -a -m "add metrics" +``` + +Now let's say we've adjusted our scripts and our AUC has changed: + +``` +$ dvc run -M metrics.json 'echo "{\"AUC\": 0.6}" > metrics.json' +``` + +To see the change, let's run `dvc metrics diff` without arguments, that would +compare our current metrics to what we've had in the last commit (similar to +`git diff`): + +``` +$ dvc metrics diff + Path Metric Value Change +metrics.json AUC 0.600 0.100 +``` From 9da0661033d13b3c4a607db68f9053f398608dc0 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 24 Jan 2020 16:54:48 -0600 Subject: [PATCH 02/27] nav: add `metrics diff` to sidebar --- public/static/docs/command-reference/metrics/diff.md | 4 ++-- public/static/docs/sidebar.json | 4 ++++ 2 files changed, 6 insertions(+), 2 deletions(-) diff --git a/public/static/docs/command-reference/metrics/diff.md b/public/static/docs/command-reference/metrics/diff.md index c799270dab..98af6abe14 100644 --- a/public/static/docs/command-reference/metrics/diff.md +++ b/public/static/docs/command-reference/metrics/diff.md @@ -1,7 +1,7 @@ # metrics diff -Find and print [project metrics](/doc/command-reference/metrics) changes between -commits, commit and a working tree, etc. +Find and print [project metrics](/doc/command-reference/metrics#description) +changes between commits, commit and a working tree, etc. ## Synopsis diff --git a/public/static/docs/sidebar.json b/public/static/docs/sidebar.json index 37d204841e..699dbcc6cc 100644 --- a/public/static/docs/sidebar.json +++ b/public/static/docs/sidebar.json @@ -253,6 +253,10 @@ "label": "metrics modify", "slug": "modify" }, + { + "label": "metrics diff", + "slug": "diff" + }, { "label": "metrics remove", "slug": "remove" From 4a1b77565cd791e712c1f31b7d72600c557a85b9 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 26 Jan 2020 23:21:45 -0600 Subject: [PATCH 03/27] cmd ref: typos in `metrics diff` --- public/static/docs/command-reference/metrics/diff.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/public/static/docs/command-reference/metrics/diff.md b/public/static/docs/command-reference/metrics/diff.md index 98af6abe14..f39d7f5e04 100644 --- a/public/static/docs/command-reference/metrics/diff.md +++ b/public/static/docs/command-reference/metrics/diff.md @@ -55,8 +55,8 @@ extension.) `dvc metrics diff` will try to detect it on-the-fly. - `-x`, `--xpath` - specify a path within a metric file to get a specific metric - value to show changes for. If ommited, will show changes for all possible - paths. It will override `xpath` defined in the correspodning DVC-file. See + value to show changes for. If omitted, will show changes for all possible + paths. It will override `xpath` defined in the corresponding DVC-file. See `dvc metrics modify` for a full description of `xpath` when applied to specific metric types. @@ -70,7 +70,7 @@ extension.) have effect. Determines the metric files to show changes for by searching each target directory and its subdirectories for DVC-files to inspect. -- `--show-json` - prints diff in easilly parsable JSON format instead of +- `--show-json` - prints diff in easily parsable JSON format instead of human-readable table. - `-h`, `--help` - prints the usage/help message, and exit. From 8ae2e6d839ab059e2b275039ab00695d1cd870bd Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Mon, 27 Jan 2020 11:24:39 -0600 Subject: [PATCH 04/27] cmd ref: rewrite `metrics diff` ref and and review related concepts throughout docs e.g. "Git reference", "working tree" --- public/static/docs/command-reference/diff.md | 41 +++++++------- public/static/docs/command-reference/get.md | 5 +- .../static/docs/command-reference/import.md | 9 ++- .../docs/command-reference/metrics/diff.md | 55 ++++++++++--------- .../static/docs/command-reference/update.md | 8 +-- .../static/docs/get-started/older-versions.md | 2 +- public/static/docs/glossary.js | 4 ++ public/static/docs/install/pre-release.md | 2 +- public/static/docs/tutorials/versioning.md | 8 +-- .../versioning-data-and-model-files.md | 10 ++-- .../static/docs/user-guide/dvc-file-format.md | 4 +- 11 files changed, 76 insertions(+), 72 deletions(-) diff --git a/public/static/docs/command-reference/diff.md b/public/static/docs/command-reference/diff.md index 724b6c59b8..18f5ba276e 100644 --- a/public/static/docs/command-reference/diff.md +++ b/public/static/docs/command-reference/diff.md @@ -12,17 +12,18 @@ narrowed down to specific target files and directories under DVC control. usage: dvc diff [-h] [-q | -v] [-t TARGET] a_ref [b_ref] positional arguments: - a_ref Git reference from which diff calculates - b_ref Git reference until which diff calculates, if - omitted diff shows the difference between - current HEAD and a_ref + a_ref Git reference from which diff calculates + b_ref Git reference until which diff calculates, if omitted + `HEAD` (latest commit) is used. ``` ## Description -Given two Git commit references (commit hash, branch or tag name, etc) `a_ref` -and `b_ref`, this command shows a a summary of basic statistics: how many files -were deleted/changed, and the file size differences. +Given two +[Git references](https://git-scm.com/book/en/v2/Git-Internals-Git-References) +(commit hash, branch or tag name, etc.) `a_ref` and `b_ref`, this command shows +a a summary of basic statistics: how many files were deleted/changed, and the +file size differences. `a_ref` is required, while `b_ref` defaults to `HEAD`. Note that `dvc diff` does not show the line-to-line comparison among the target files in each revision, like `git diff` does. @@ -31,17 +32,13 @@ files in each revision, like `git diff` does. > [issue #770](https://github.com/iterative/dvc/issues/770#issuecomment-512693256) > in our GitHub repository. -If the `-t` option is used, the diff is limited to the `TARGET` file or -directory specified. - -Note that `dvc diff` does not have an effect when the repository is not tracked -by the Git SCM, for example when `dvc init` was used with the `--no-scm` option. +`dvc diff` does not have an effect when the repository is not tracked by Git, +for example when `dvc init` was used with the `--no-scm` option. ## Options -- `-t TARGET`, `--target TARGET` - path to a data file or directory. If not - specified, compares all files and directories that are under DVC control in - the workspace. +- `-t TARGET`, `--target TARGET` - path to a data file or directory to limit + diff for. - `-h`, `--help` - prints the usage/help message, and exit. @@ -83,11 +80,12 @@ Preparing to download data from 'https://remote.dvc.org/get-started' ## Example: Previous version of the same branch -The minimal `dvc diff` command only includes the A reference (`a_ref`) from -which the difference is to be calculated. The B reference (`b_ref`) defaults to -Git `HEAD` (the currently checked out version). To find the general differences -with the very previous committed version of the project, we can use the `HEAD^` -Git reference. +The minimal `dvc diff` command only includes the "from" reference (`a_ref`) from +which to calculate the difference. The "until" reference (`b_ref`) defaults to +`HEAD` (currently checked out Git version). + +To find the general differences with the very previous version of the project, +we can use `HEAD^` as reference A: ```dvc $ dvc diff HEAD^ @@ -143,7 +141,8 @@ diff for 'model.pkl' ``` The output from this command confirms that there's a difference in the -`model.pkl` file between the 2 Git references we indicated. +`model.pkl` file between the 2 Git references (tags `baseline-experiment` and +`bigrams-experiment`) we indicated. ### What about directories? diff --git a/public/static/docs/command-reference/get.md b/public/static/docs/command-reference/get.md index 05c654065f..5e2e71d09b 100644 --- a/public/static/docs/command-reference/get.md +++ b/public/static/docs/command-reference/get.md @@ -58,9 +58,8 @@ name. an existing directory is specified, then the output will be placed inside of it. -- `--rev` - `url` is expected to represent a Git repository for this option to - have an effect. Specific - [Git revision](https://git-scm.com/book/en/v2/Git-Internals-Git-References) +- `--rev` - Specific + [Git reference](https://git-scm.com/book/en/v2/Git-Internals-Git-References) (such as a branch name, a tag, or a commit hash) of the repository to download the file or directory from. The tip of the default branch is used by default when this option is not specified. diff --git a/public/static/docs/command-reference/import.md b/public/static/docs/command-reference/import.md index 19c09a2806..c2fd3d3462 100644 --- a/public/static/docs/command-reference/import.md +++ b/public/static/docs/command-reference/import.md @@ -75,9 +75,8 @@ data artifact from the source project. an existing directory is specified, then the output will be placed inside of it. -- `--rev` - `url` is expected to represent a Git repository for this option to - have an effect. Specific - [Git revision](https://git-scm.com/book/en/v2/Git-Internals-Git-References) +- `--rev` - Specific + [Git reference](https://git-scm.com/book/en/v2/Git-Internals-Git-References) (such as a branch name, a tag, or a commit hash) of the repository to download the file or directory from. The tip of the default branch is used by default when this option is not specified. @@ -161,9 +160,9 @@ deps: ``` If the -[Git revision](https://git-scm.com/book/en/v2/Git-Internals-Git-References) +[Git reference](https://git-scm.com/book/en/v2/Git-Internals-Git-References) moves (e.g. a branch), you may use `dvc update` to bring the data up to date. -However, for typically static references (e.g. tags), or for SHA commits, in +However, for typically static references (e.g. tags), or for commits hashes, in order to actually "update" an import, it's necessary to **re-import the data** instead, by using `dvc import` again without or with a different `--rev`. This will overwrite the import stage (DVC-file), either removing or replacing the diff --git a/public/static/docs/command-reference/metrics/diff.md b/public/static/docs/command-reference/metrics/diff.md index f39d7f5e04..9e080f4fe5 100644 --- a/public/static/docs/command-reference/metrics/diff.md +++ b/public/static/docs/command-reference/metrics/diff.md @@ -1,45 +1,48 @@ # metrics diff -Find and print [project metrics](/doc/command-reference/metrics#description) -changes between commits, commit and a working tree, etc. +Find and print changes in [metrics](/doc/command-reference/metrics#description) +between project versions. ## Synopsis ```usage -usage: dvc metrics diff [-h] [-q | -v] [--targets [TARGETS [TARGETS ...]]] +usage: dvc metrics diff [-h] [-q | -v] + [--targets [TARGETS [TARGETS ...]]] [-t TYPE] [-x XPATH] [-R] [--show-json] [a_ref] [b_ref] positional arguments: - a_ref Git reference from which diff is calculated. If - omitted `HEAD`(latest commit) is used. - b_ref Git reference to which diff is calculated. If omitted - current working tree is used. + a_ref Git reference from which diff is calculated. If + omitted, `HEAD` (latest commit) is used. + b_ref Git reference to which diff is calculated. If omitted, + the current workspace is used instead. ``` ## Description -Finds and prints changes between commits for all metrics in the +Calculates the numeric difference (delta) between a metric's value in two +different +[Git references](https://git-scm.com/book/en/v2/Git-Internals-Git-References) +(such as a branch name, a tag, or a commit hash) for all metrics in the project by examining all of its -[DVC-files](/doc/user-guide/dvc-file-format). If `--targets` are provided, it -will show changes for those specific metric files instead. +[DVC-files](/doc/user-guide/dvc-file-format). -The optional `--targets` argument can contain several metric files. With the -`-R` option, a target can even be a directory, so that DVC recursively shows -changes for all metric files in it. +Note that `a_ref` and `b_ref` have different defaults than those in `dvc diff`, +and omitting `b_ref` causes the current workspace metrics (included +uncommitted local changes) to be used, instead of a Git reference. -Providing a `type` (`-t` option) overwrites the full metric specification (both -`type` and `xpath` fields) defined in the -[DVC-file](/doc/user-guide/dvc-file-format) (usually set originally with the -`dvc metrics modify` command). +If `--targets` are provided, it will show changes for those specific metric +files instead. With the `-R` option, a target can even be a directory, so that +DVC recursively shows changes for all metric files in it. -If `type` (via `-t`) is not specified and only `xpath` (`-x` option) is, only -the `xpath` field is overwritten in its DVC-file. (DVC will first try to read -`type` from the DVC-file, but it can be automatically detected by the file +Providing a type of metric (`-t` option) overwrites the full metric +specification (both `type` and `xpath` fields) defined in the +[DVC-file](/doc/user-guide/dvc-file-format). If only the `--xpath` (`-x`) option +is used, just the `xpath` field is overwritten. (DVC will first try to read +`type` from the DVC-file, or it can be automatically detected by the file extension.) -> Alternatively, see `dvc metrics modify` command to learn how to apply `-t` and -> `-x` permanently. +> See `dvc metrics modify` to learn how to apply `-t` and `-x` permanently. ## Options @@ -82,7 +85,7 @@ extension.) ## Examples -Let's create a metrics file using a dummy command and commit it to git: +Let's create a metrics file using a dummy command and commit it with Git: ``` $ dvc run -M metrics.json 'echo "{\"AUC\": 0.5}" > metrics.json' @@ -95,9 +98,9 @@ Now let's say we've adjusted our scripts and our AUC has changed: $ dvc run -M metrics.json 'echo "{\"AUC\": 0.6}" > metrics.json' ``` -To see the change, let's run `dvc metrics diff` without arguments, that would -compare our current metrics to what we've had in the last commit (similar to -`git diff`): +To see the change, let's run `dvc metrics diff` without arguments. This compares +our current workspace metrics to what we had in the previous +commit: ``` $ dvc metrics diff diff --git a/public/static/docs/command-reference/update.md b/public/static/docs/command-reference/update.md index 93df4f7949..2bc68a818a 100644 --- a/public/static/docs/command-reference/update.md +++ b/public/static/docs/command-reference/update.md @@ -30,10 +30,10 @@ update them. Another detail to note is that when the `--rev` (revision) option of `dvc import` has been used to create an import stage, DVC is not aware of what kind of -[Git revision](https://git-scm.com/book/en/v2/Git-Internals-Git-References) this -is, for example a branch or a tag. For typically static references (e.g. tags), -or for SHA commits, `dvc update` will not have any effect on the import. Refer -to the +[Git reference](https://git-scm.com/book/en/v2/Git-Internals-Git-References) +this is, for example a branch or a tag. For typically static references (e.g. +tags), or for commits hashes, `dvc update` will not have any effect on the +import. Refer to the [re-importing example](/doc/command-reference/import#example-fixed-revisions-re-importing) to learn how to "update" fixed-revision imports. diff --git a/public/static/docs/get-started/older-versions.md b/public/static/docs/get-started/older-versions.md index 5371cc59af..4fa976e442 100644 --- a/public/static/docs/get-started/older-versions.md +++ b/public/static/docs/get-started/older-versions.md @@ -17,7 +17,7 @@ $ dvc checkout train.dvc ``` These two commands will bring the previous model file to its place in the -working tree. +workspace.
diff --git a/public/static/docs/glossary.js b/public/static/docs/glossary.js index 40d252c180..38297685a2 100644 --- a/public/static/docs/glossary.js +++ b/public/static/docs/glossary.js @@ -13,6 +13,10 @@ code, ML models, etc. A workspace becomes a **DVC project** when [\`dvc init\`](/doc/command-reference/init) is run, and [DVC-files](/doc/user-guide/dvc-file-format) or stage files are created in it. +Includes the +[working tree](https://git-scm.com/docs/gitglossary#def_working_tree) (\`HEAD\` +plus local changes) for Git repositories. + Note that [external outputs](/doc/user-guide/managing-external-data) also form part of your expanded workspace, technically. ` diff --git a/public/static/docs/install/pre-release.md b/public/static/docs/install/pre-release.md index 8a1ea93e4c..1f31b1353a 100644 --- a/public/static/docs/install/pre-release.md +++ b/public/static/docs/install/pre-release.md @@ -15,7 +15,7 @@ $ pip install git+https://github.com/iterative/dvc ``` > `gitpython` allows the installation process to generate a DVC version using -> the current Git commit SHA. This lets us to distinguish official DVC releases +> the current Git commit hash. This lets us to distinguish official DVC releases > (e.g. `0.64.3`) from a development version (e.g. `0.64.3-9c7381`). For more > information on our versioning convention, refer to > [Components of DVC version](/doc/command-reference/version#components-of-dvc-version). diff --git a/public/static/docs/tutorials/versioning.md b/public/static/docs/tutorials/versioning.md index 9279d31e14..fb9d52ef2a 100644 --- a/public/static/docs/tutorials/versioning.md +++ b/public/static/docs/tutorials/versioning.md @@ -263,10 +263,10 @@ $ git checkout v1.0 $ dvc checkout ``` -These commands will restore the working tree to the first snapshot we made: -code, data files, model, all of it. DVC optimizes this operation to avoid -copying data or model files each time. So `dvc checkout` is quick even if you -have large datasets, data files, or models. +These commands will restore the workspace to the first snapshot we made: code, +data files, model, all of it. DVC optimizes this operation to avoid copying data +or model files each time. So `dvc checkout` is quick even if you have large +datasets, data files, or models. On the other hand, if we want to keep the current version of the code and go back to the previous dataset only, we can do something like this (make sure that diff --git a/public/static/docs/use-cases/versioning-data-and-model-files.md b/public/static/docs/use-cases/versioning-data-and-model-files.md index 567891a14d..f9a19fa8da 100644 --- a/public/static/docs/use-cases/versioning-data-and-model-files.md +++ b/public/static/docs/use-cases/versioning-data-and-model-files.md @@ -83,17 +83,17 @@ There are two ways to get to the previous version of the dataset or model: a full workspace checkout, or checkout of a specific data or model file. Let's consider the full checkout first. It's quite straightforward: -> `v1.0` is a Git tag that should be created in advance to identify the dataset -> version you are interested in. Any Git reference (for example `HEAD^` or a -> commit hash) can be used instead. +> `v1.0` below is a Git tag that should be created in advance to identify the +> dataset version you are interested in. Any Git version (for example `HEAD^` or +> a commit hash) can be used instead. ```dvc $ git checkout v1.0 $ dvc checkout ``` -These commands will restore the working tree to the first snapshot we made - -code, dataset and model files all matching each other. DVC can +These commands will restore the workspace to the first snapshot we made - code, +dataset and model files all matching each other. DVC can [optimize](/doc/user-guide/large-dataset-optimization) this operation to avoid copying files each time, so `dvc checkout` is quick even if you have large dataset or model files. diff --git a/public/static/docs/user-guide/dvc-file-format.md b/public/static/docs/user-guide/dvc-file-format.md index 9c4542d5e1..7458248dfe 100644 --- a/public/static/docs/user-guide/dvc-file-format.md +++ b/public/static/docs/user-guide/dvc-file-format.md @@ -68,8 +68,8 @@ A dependency entry consists of a pair of fields: - `url`: URL of Git repository with source DVC project - `rev`: Only present when the `--rev` option of `dvc import` is used. Specific - [Git revision](https://git-scm.com/book/en/v2/Git-Internals-Git-References) - used to import the dependency from. + [Git reference](https://git-scm.com/book/en/v2/Git-Internals-Git-References) + (such as a branch name or a tag) used to import the dependency from. - `rev_lock`: Revision or version (Git commit hash) of the external DVC repository at the time of importing or updating (with `dvc update`) the dependency. From c056834c4ff25520063582524c4aa50671448e22 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Mon, 27 Jan 2020 13:28:14 -0600 Subject: [PATCH 05/27] cmd ref: update descs, review options, link all metrics subcmds addresses https://github.com/iterative/dvc.org/pull/933#pullrequestreview-348849914 as well as https://github.com/iterative/dvc.org/pull/933#pullrequestreview-348847997 and https://github.com/iterative/dvc.org/pull/933#pullrequestreview-348858027 --- .../static/docs/command-reference/checkout.md | 6 +- .../docs/command-reference/metrics/add.md | 24 +++--- .../docs/command-reference/metrics/diff.md | 83 ++++++++----------- .../docs/command-reference/metrics/index.md | 1 + .../docs/command-reference/metrics/modify.md | 20 ++--- .../docs/command-reference/metrics/show.md | 44 +++++----- .../static/docs/command-reference/status.md | 10 +-- 7 files changed, 86 insertions(+), 102 deletions(-) diff --git a/public/static/docs/command-reference/checkout.md b/public/static/docs/command-reference/checkout.md index 5045082b5b..af024f859c 100644 --- a/public/static/docs/command-reference/checkout.md +++ b/public/static/docs/command-reference/checkout.md @@ -39,9 +39,9 @@ The execution of `dvc checkout` does the following: DVC-file, are restored from the cache. See options `--force` and `--relink`. -By default, this command tries not to copy files between the cache and the -workspace, using reflinks instead, when supported by the file system. (Refer to -[File link types](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache).) +By default, this command tries not make copies of cached files in the workspace, +using reflinks instead when supported by the file system (refer to +[File link types](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache)). The next linking strategy default value is `copy` though, so unless other file link types are manually configured in `cache.type` (using `dvc config`), files will be copied. Keep in mind that having file copies doesn't present much of a diff --git a/public/static/docs/command-reference/metrics/add.md b/public/static/docs/command-reference/metrics/add.md index d610af862c..8b571b5a3e 100644 --- a/public/static/docs/command-reference/metrics/add.md +++ b/public/static/docs/command-reference/metrics/add.md @@ -20,8 +20,8 @@ defines the given `path` as an output, marking `path` as a Note that outputs can also be marked as metrics via the `-m` or `-M` options of the `dvc run` command. -While any text file can be tracked as a metric file, we recommend using `TSV`, -`CSV`, or `JSON` formats. DVC provides a way to parse those formats to get to a +While any text file can be tracked as a metric file, we recommend using TSV, +CSV, or JSON formats. DVC provides a way to parse those formats to get to a specific value, if the file contains multiple metrics. See `dvc metrics show` for more details. @@ -30,22 +30,22 @@ for more details. ## Options -- `-t`, `--type` - specify a type of the metric file. Accepted values are: - `raw`, `json`, `tsv`, `htsv`, `csv`, `hcsv`. It will be saved into the +- `-t`, `--type` - specify a type of the metric file. Accepted values are: `raw` + (default), `json`, `tsv`, `htsv`, `csv`, `hcsv`. It will be saved into the corresponding DVC-file, and used by `dvc metrics show` to determine how to handle displaying metrics. - `raw` is the default when no type is provided. It means that no additional - parsing is applied, and `--xpath` is ignored. `htsv`/`hcsv` are the same as - `tsv`/`csv`, but the values in the first row of the file will be used as the - field names and should be used to address columns in the `--xpath` option. + `raw` means that no additional parsing is applied, and `--xpath` is ignored. + `htsv`/`hcsv` are the same as `tsv`/`csv`, but the values in the first row of + the file will be used as the field names and should be used to address columns + in the `--xpath` option. - `-x`, `--xpath` - specify a path within a metric file to get a specific metric value. Should be used if the metric file contains multiple numbers and you - need to get a only one of them. Only a single path is allowed. It will be - saved into the corresponding DVC-file, and used by `dvc metrics show` to - determine how to handle displaying metrics. The accepted value depends on the - metric file type (`--type` option): + want to use only one of them. Only a single path is allowed. It will be saved + into the corresponding DVC-file, and used by `dvc metrics show` to determine + how to display metrics. The accepted value depends on the metric file type + (`--type` option): - For `json` - see [JSONPath spec](https://goessner.net/articles/JsonPath/) or [jsonpath-ng](https://github.com/h2non/jsonpath-ng) for available options. diff --git a/public/static/docs/command-reference/metrics/diff.md b/public/static/docs/command-reference/metrics/diff.md index 9e080f4fe5..59b73631b7 100644 --- a/public/static/docs/command-reference/metrics/diff.md +++ b/public/static/docs/command-reference/metrics/diff.md @@ -1,7 +1,8 @@ # metrics diff -Find and print changes in [metrics](/doc/command-reference/metrics#description) -between project versions. +Show a table of changes between +[metrics](/doc/command-reference/metrics#description) among +repository versions. ## Synopsis @@ -20,61 +21,45 @@ positional arguments: ## Description -Calculates the numeric difference (delta) between a metric's value in two +The changes shown by this command includes the new value, and numeric difference +(delta) from the previous value of metrics. They're calculated between two different [Git references](https://git-scm.com/book/en/v2/Git-Internals-Git-References) -(such as a branch name, a tag, or a commit hash) for all metrics in the -project by examining all of its -[DVC-files](/doc/user-guide/dvc-file-format). +(such as branch names, tags, or commit SHA hashes) for all metrics in the +project, found by examining all of the +[DVC-files](/doc/user-guide/dvc-file-format) in both versions. -Note that `a_ref` and `b_ref` have different defaults than those in `dvc diff`, -and omitting `b_ref` causes the current workspace metrics (included -uncommitted local changes) to be used, instead of a Git reference. +The metrics to use in this command can be limited with the `--targets` option. +target can also be directories (with the `-R` option), so that DVC recursively +shows changes for all metric files in it. -If `--targets` are provided, it will show changes for those specific metric -files instead. With the `-R` option, a target can even be a directory, so that -DVC recursively shows changes for all metric files in it. +## Options -Providing a type of metric (`-t` option) overwrites the full metric -specification (both `type` and `xpath` fields) defined in the -[DVC-file](/doc/user-guide/dvc-file-format). If only the `--xpath` (`-x`) option -is used, just the `xpath` field is overwritten. (DVC will first try to read -`type` from the DVC-file, or it can be automatically detected by the file -extension.) +- `--targets` - specific metric files or directories to calculate metrics + differences for. If omitted (default), this command use all metric files found + in both Git references. -> See `dvc metrics modify` to learn how to apply `-t` and `-x` permanently. +- `-R`, `--recursive` - determines the metric files to use by searching each + target directory and its subdirectories for DVC-files to inspect. `targets` is + expected to contain one or more directories for this option to have effect. -## Options +- `-t`, `--type` - specify a type of the metric file. Accepted values are: `raw` + (default), `json`, `tsv`, `htsv`, `csv`, `hcsv`. It will be used to determine + how to parse and format metics for display. See `dvc metrics modify` for more + details. + + This option will override `type` and `xpath` defined in the corresponding + DVC-file. If no `type` is provided or found in the DVC-file, DVC will try to + detect it based on file extension. + +- `-x`, `--xpath` - specify a path within a metric file to show changes for a + specific metric value only. Should be used if the metric file contains + multiple numbers and you want to use only one of them. Only a single path is + allowed. It will override `xpath` defined in the corresponding DVC-file. See + `dvc metrics modify` for more details. -- `--targets` - metric files or directories (see -R) to show changes for. If not - specified, will show changes for all metric files, if not specified. - -- `-t`, `--type` - specify a type of the metric file. Accepted values are: - `raw`, `json`, `tsv`, `htsv`, `csv`, `hcsv`. It will be used to determine - appropriate parsing and displaying format for this metric file and will - override `type` defined in the corresponding DVC-file See `dvc metrics modify` - for a full description of acceptable types. If no `type` is specified either - as a CLI `-t|--type` nor in the corresponding DVC-file itself, - `dvc metrics diff` will try to detect it on-the-fly. - -- `-x`, `--xpath` - specify a path within a metric file to get a specific metric - value to show changes for. If omitted, will show changes for all possible - paths. It will override `xpath` defined in the corresponding DVC-file. See - `dvc metrics modify` for a full description of `xpath` when applied to - specific metric types. - - If multiple metric files exist in the project, the same parser - and path will be applied to all of them. If `xpath` for a particular metric - has been set using `dvc metrics modify`, the path passed with this option will - overwrite it for the current command run only – It may fail to produce any - results or parse files that are not in a corresponding format in this case. - -- `-R`, `--recursive` - `path` is expected to be a directory for this option to - have effect. Determines the metric files to show changes for by searching each - target directory and its subdirectories for DVC-files to inspect. - -- `--show-json` - prints diff in easily parsable JSON format instead of - human-readable table. +- `--show-json` - prints the command's output in easily parsable JSON format, + instead of a human-readable table. - `-h`, `--help` - prints the usage/help message, and exit. diff --git a/public/static/docs/command-reference/metrics/index.md b/public/static/docs/command-reference/metrics/index.md index 538b474421..ebdd850361 100644 --- a/public/static/docs/command-reference/metrics/index.md +++ b/public/static/docs/command-reference/metrics/index.md @@ -3,6 +3,7 @@ A set of commands to collect and display project metrics: [add](/doc/command-reference/metrics/add), [show](/doc/command-reference/metrics/show), +[diff](/doc/command-reference/metrics/show), [modify](/doc/command-reference/metrics/modify), and [remove](/doc/command-reference/metrics/remove). diff --git a/public/static/docs/command-reference/metrics/modify.md b/public/static/docs/command-reference/metrics/modify.md index bfe6297881..dcd5fb657d 100644 --- a/public/static/docs/command-reference/metrics/modify.md +++ b/public/static/docs/command-reference/metrics/modify.md @@ -33,22 +33,22 @@ ERROR: failed to modify metric file settings - ## Options -- `-t`, `--type` - specify a type of the metric file. Accepted values are: - `raw`, `json`, `tsv`, `htsv`, `csv`, `hcsv`. It will be saved into the +- `-t`, `--type` - specify a type of the metric file. Accepted values are: `raw` + (default), `json`, `tsv`, `htsv`, `csv`, `hcsv`. It will be saved into the corresponding DVC-file, and used by `dvc metrics show` to determine how to handle displaying metrics. - `raw` is the default when no type is provided. It means that no additional - parsing is applied, and `--xpath` is ignored. `htsv`/`hcsv` are the same as - `tsv`/`csv`, but the values in the first row of the file will be used as the - field names and should be used to address columns in the `--xpath` option. + `raw` means that no additional parsing is applied, and `--xpath` is ignored. + `htsv`/`hcsv` are the same as `tsv`/`csv`, but the values in the first row of + the file will be used as the field names and should be used to address columns + in the `--xpath` option. - `-x`, `--xpath` - specify a path within a metric file to get a specific metric value. Should be used if the metric file contains multiple numbers and you - need to get a only one of them. Only a single path is allowed. It will be - saved into the corresponding DVC-file, and used by `dvc metrics show` to - determine how to handle displaying metrics. The accepted value depends on the - metric file type (`--type` option): + want to use only one of them. Only a single path is allowed. It will be saved + into the corresponding DVC-file, and used by `dvc metrics show` to determine + how to display metrics. The accepted value depends on the metric file type + (`--type` option): - For `json` - see [JSONPath spec](https://goessner.net/articles/JsonPath/) or [jsonpath-ng](https://github.com/h2non/jsonpath-ng) for available options. diff --git a/public/static/docs/command-reference/metrics/show.md b/public/static/docs/command-reference/metrics/show.md index b41606b45f..a8d0a45fac 100644 --- a/public/static/docs/command-reference/metrics/show.md +++ b/public/static/docs/command-reference/metrics/show.md @@ -26,43 +26,41 @@ The optional `targets` argument can contain several metric files. With the `-R` option, a target can even be a directory, so that DVC recursively shows all metric files in it. -Providing a `type` (`-t` option) overwrites the full metric specification (both +Providing a `type` (`-t` option) overrides the full metric specification (both `type` and `xpath` fields) defined in the -[DVC-file](/doc/user-guide/dvc-file-format) (usually set originally with the -`dvc metrics modify` command). +[DVC-file](/doc/user-guide/dvc-file-format) (originally with +`dvc metrics modify`, typically). If `type` (via `-t`) is not specified and only `xpath` (`-x` option) is, only -the `xpath` field is overwritten in its DVC-file. (DVC will first try to read +the `xpath` field from the DVC-file is overridden. (DVC will first try to read `type` from the DVC-file, but it can be automatically detected by the file extension.) -> Alternatively, see `dvc metrics modify` command to learn how to apply `-t` and -> `-x` permanently. +> See `dvc metrics modify` to learn how to apply `-t` and `-x` permanently. + +See also `dvc metrics diff` to show changes in metrics between different +repository versions. ## Options -- `-t`, `--type` - specify a type of the metric file. Accepted values are: - `raw`, `json`, `tsv`, `htsv`, `csv`, `hcsv`. It will be saved into the - corresponding DVC-file, and used by `dvc metrics show` to determine how to - handle displaying metrics. +- `-t`, `--type` - specify a type of the metric file. Accepted values are: `raw` + (default), `json`, `tsv`, `htsv`, `csv`, `hcsv`. It will be used to determine + how to parse and format metics for display. - `raw` is the default when no type is provided. It means that no additional - parsing is applied, and `--xpath` is ignored. `htsv`/`hcsv` are the same as - `tsv`/`csv`, but the values in the first row of the file will be used as the - field names and should be used to address columns in the `--xpath` option. + `raw` means that no additional parsing is applied, and `--xpath` is ignored. + `htsv`/`hcsv` are the same as `tsv`/`csv`, but the values in the first row of + the file will be used as the field names and should be used to address columns + in the `--xpath` option. - This option along with `--xpath` below takes precedence over the `type` and - `xpath` specified in the corresponding DVC file. If this parameter is not - given, the type can be detected by the file extension automatically if the - type is supported. If any other value is specified, it is ignored and - defaulted back to `raw`. + This option will override `type` and `xpath` defined in the corresponding + DVC-file. If no `type` is provided or found in the DVC-file, DVC will try to + detect it based on file extension. - `-x`, `--xpath` - specify a path within a metric file to get a specific metric value. Should be used if the metric file contains multiple numbers and you - need to get a only one of them. Only a single path is allowed. It will be - saved into the corresponding DVC-file, and used by `dvc metrics show` to - determine how to handle displaying metrics. The accepted value depends on the - metric file type (`--type` option): + want to use only one of them. Only a single path is allowed. It will override + `xpath` defined in the corresponding DVC-file. The accepted value depends on + the metric file type (`--type` option): - For `json` - see [JSONPath spec](https://goessner.net/articles/JsonPath/) or [jsonpath-ng](https://github.com/h2non/jsonpath-ng) for available options. diff --git a/public/static/docs/command-reference/status.md b/public/static/docs/command-reference/status.md index 559d149d21..36baff33de 100644 --- a/public/static/docs/command-reference/status.md +++ b/public/static/docs/command-reference/status.md @@ -19,11 +19,11 @@ positional arguments: ## Description `dvc status` searches for changes in the existing pipelines, either showing -which [stages](/doc/command-reference/run) have changed in the workspace and -must be reproduced (with `dvc repro`), or differences between cache vs. remote -storage (meaning `dvc push` or `dvc pull` should be run to synchronize them). -The two modes, _local_ and _cloud_ are triggered by using the `--cloud` or -`--remote` options: +which [stages](/doc/command-reference/run) have changed in the workspace +(including uncommitted local changes) and must be reproduced (with `dvc repro`), +or differences between cache vs. remote storage (meaning `dvc push` or +`dvc pull` should be run to synchronize them). The two modes, _local_ and +_cloud_ are triggered by using the `--cloud` or `--remote` options: | Mode | CLI Option | Description | | ------ | ---------- | --------------------------------------------------------------------------------------------------------------------------- | From 344839a145fd7ddaabb8cbf8caa1c96af8f25e4e Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Mon, 27 Jan 2020 13:43:53 -0600 Subject: [PATCH 06/27] cmd ref: update cmd argument descriptions for `diff` and `metics diff` --- public/static/docs/command-reference/diff.md | 8 ++++---- public/static/docs/command-reference/metrics/diff.md | 6 +++--- 2 files changed, 7 insertions(+), 7 deletions(-) diff --git a/public/static/docs/command-reference/diff.md b/public/static/docs/command-reference/diff.md index 18f5ba276e..568f3aeda0 100644 --- a/public/static/docs/command-reference/diff.md +++ b/public/static/docs/command-reference/diff.md @@ -12,8 +12,8 @@ narrowed down to specific target files and directories under DVC control. usage: dvc diff [-h] [-q | -v] [-t TARGET] a_ref [b_ref] positional arguments: - a_ref Git reference from which diff calculates - b_ref Git reference until which diff calculates, if omitted + a_ref Git reference from which the diff begins + b_ref Git reference until which the diff ends. If omitted, `HEAD` (latest commit) is used. ``` @@ -22,8 +22,8 @@ positional arguments: Given two [Git references](https://git-scm.com/book/en/v2/Git-Internals-Git-References) (commit hash, branch or tag name, etc.) `a_ref` and `b_ref`, this command shows -a a summary of basic statistics: how many files were deleted/changed, and the -file size differences. `a_ref` is required, while `b_ref` defaults to `HEAD`. +a summary of basic statistics: how many files were deleted/changed, and the file +size differences. `a_ref` is required, while `b_ref` defaults to `HEAD`. Note that `dvc diff` does not show the line-to-line comparison among the target files in each revision, like `git diff` does. diff --git a/public/static/docs/command-reference/metrics/diff.md b/public/static/docs/command-reference/metrics/diff.md index 59b73631b7..4079d76008 100644 --- a/public/static/docs/command-reference/metrics/diff.md +++ b/public/static/docs/command-reference/metrics/diff.md @@ -13,9 +13,9 @@ usage: dvc metrics diff [-h] [-q | -v] [a_ref] [b_ref] positional arguments: - a_ref Git reference from which diff is calculated. If - omitted, `HEAD` (latest commit) is used. - b_ref Git reference to which diff is calculated. If omitted, + a_ref Git reference from which the diff begins. If omitted, + `HEAD` (latest commit) is used. + b_ref Git reference until which the diff ends. If omitted, the current workspace is used instead. ``` From 5a08d37f940054a5553b264893d5085175e9f427 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 29 Jan 2020 00:04:49 -0600 Subject: [PATCH 07/27] metrics diff: big terminology review around the intro of this new command per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-348855380 et al. --- public/static/docs/command-reference/add.md | 4 +- public/static/docs/command-reference/diff.md | 32 +++++++------- public/static/docs/command-reference/fetch.md | 2 +- public/static/docs/command-reference/gc.md | 9 ++-- public/static/docs/command-reference/get.md | 24 +++++------ .../docs/command-reference/import-url.md | 4 +- .../static/docs/command-reference/import.md | 13 +++--- .../static/docs/command-reference/install.md | 14 +++---- .../docs/command-reference/metrics/diff.md | 15 ++++--- .../docs/command-reference/metrics/show.md | 2 +- public/static/docs/command-reference/pull.md | 4 +- public/static/docs/command-reference/push.md | 6 +-- .../docs/command-reference/remote/modify.md | 2 +- .../static/docs/command-reference/status.md | 42 +++++++++---------- public/static/docs/get-started/experiments.md | 2 +- public/static/docs/get-started/import-data.md | 2 +- public/static/docs/tutorials/pipelines.md | 2 +- public/static/docs/tutorials/versioning.md | 17 ++++---- .../understanding-dvc/collaboration-issues.md | 6 +-- .../docs/understanding-dvc/what-is-dvc.md | 10 +++-- .../static/docs/use-cases/data-registries.md | 5 ++- .../versioning-data-and-model-files.md | 8 ++-- public/static/docs/user-guide/analytics.md | 11 +++-- .../docs/user-guide/contributing/core.md | 2 +- .../docs/user-guide/contributing/docs.md | 4 +- .../static/docs/user-guide/dvc-file-format.md | 10 ++--- .../docs/user-guide/external-dependencies.md | 2 +- .../docs/user-guide/managing-external-data.md | 14 +++---- .../docs/user-guide/running-dvc-on-windows.md | 9 ++-- 29 files changed, 136 insertions(+), 141 deletions(-) diff --git a/public/static/docs/command-reference/add.md b/public/static/docs/command-reference/add.md index a50d6f0ddd..45989312a5 100644 --- a/public/static/docs/command-reference/add.md +++ b/public/static/docs/command-reference/add.md @@ -74,8 +74,8 @@ to work with directory hierarchies with `dvc add`: directory (with default name `dirname.dvc`). Every file in the hierarchy is added to the cache (unless `--no-commit` flag is added), but DVC does not produce individual DVC-files for each file in the directory tree. Instead, - the single DVC-file points to a file in the cache that contains references to - the files in the added hierarchy. + the single DVC-file references a file in the cache that in turn points to the + files in the added hierarchy. In a DVC project, `dvc add` can be used to version control any data artifact (input, intermediate, or output files and diff --git a/public/static/docs/command-reference/diff.md b/public/static/docs/command-reference/diff.md index 568f3aeda0..7a14e8b189 100644 --- a/public/static/docs/command-reference/diff.md +++ b/public/static/docs/command-reference/diff.md @@ -1,10 +1,10 @@ # diff -Show changes between versions of the DVC project. It can be +Show changes between revisions of the DVC repository. It can be narrowed down to specific target files and directories under DVC control. -> This command requires that the project is a [Git](https://git-scm.com/) -> repository. +> This command requires that the project is a +> [Git](https://git-scm.com/) repository. ## Synopsis @@ -58,9 +58,9 @@ For these examples we can use the chapters in our ### Click and expand to setup example Start by cloning our example repo if you don't already have it. Then move into -the repo and checkout the -[version](https://github.com/iterative/example-get-started/releases/tag/3-add-file) -corresponding to the _Add Files_ chapter: +the repo and checkout +[the revision](https://github.com/iterative/example-get-started/releases/tag/3-add-file) +corresponding to the [Add Files](/doc/get-started/add-files) chapter: ```dvc $ git clone https://github.com/iterative/example-get-started @@ -80,12 +80,12 @@ Preparing to download data from 'https://remote.dvc.org/get-started' ## Example: Previous version of the same branch -The minimal `dvc diff` command only includes the "from" reference (`a_ref`) from -which to calculate the difference. The "until" reference (`b_ref`) defaults to -`HEAD` (currently checked out Git version). +The minimal `dvc diff` command only includes the "from" revision (`a_ref`) from +which to calculate the difference. The "until" revision (`b_ref`) defaults to +`HEAD` (currently checked out Git revision). -To find the general differences with the very previous version of the project, -we can use `HEAD^` as reference A: +To find the general differences with the very previous revision of the project, +we can use `HEAD^` as `a_ref`: ```dvc $ dvc diff HEAD^ @@ -97,7 +97,7 @@ diff for 'data/data.xml' added file with size 37.9 MB ``` -## Example: Specific targets across Git references +## Example: Specific targets across Git revisions We can base this example in the [Metrics](/doc/get-started/metrics) and [Compare Experiments](/doc/get-started/compare-experiments) chapters of our _Get @@ -127,7 +127,7 @@ example repo.
-To see the difference in `model.pkl` among these versions, we can run the +To see the difference in `model.pkl` among these revisions, we can run the following command. ```dvc @@ -141,7 +141,7 @@ diff for 'model.pkl' ``` The output from this command confirms that there's a difference in the -`model.pkl` file between the 2 Git references (tags `baseline-experiment` and +`model.pkl` file between the 2 Git revisions (tags `baseline-experiment` and `bigrams-experiment`) we indicated. ### What about directories? @@ -190,6 +190,6 @@ diff for 'data/prepared' ``` The command above checks whether there have been any changes to the -`data/prepared` directory after the `5-preparation` version (since the `b_ref` -is the current version, `HEAD` by default). The output tells us that there have +`data/prepared` directory after the `5-preparation` revision (since the `b_ref` +is the current revision, `HEAD` by default). The output tells us that there have been no changes to that directory (or to any other file). diff --git a/public/static/docs/command-reference/fetch.md b/public/static/docs/command-reference/fetch.md index 32cf8d1f7c..8e8aeca7c1 100644 --- a/public/static/docs/command-reference/fetch.md +++ b/public/static/docs/command-reference/fetch.md @@ -154,7 +154,7 @@ solving the problem: $ git tag baseline-experiment <- first simple version of the model -bigrams-experiment <- use bigrams to improve the model +bigrams-experiment <- use bigrams to improve the model ``` ## Example: Default behavior diff --git a/public/static/docs/command-reference/gc.md b/public/static/docs/command-reference/gc.md index 4ba6cf11ab..ae986ad492 100644 --- a/public/static/docs/command-reference/gc.md +++ b/public/static/docs/command-reference/gc.md @@ -24,7 +24,7 @@ There are important things to note when using Git to version the - If the cache/remote holds several versions of the same data, all except the current one will be deleted. - Use the `--all-branches` or `--all-tags` options to avoid collecting data - referenced in the tips of all branches or in all tags, respectively. + referenced in the tips of all branches or all tags, respectively. Unless the `--cloud` (`-c`) option is used, `dvc gc` does not remove data files from any remote. This means that any files collected from the local cache can be @@ -38,10 +38,9 @@ restored using `dvc fetch`, as long as they have previously been uploaded with latest experiment revisions. Especially, if you intend to use `dvc gc -c` this option is much safer. -- `-T`, `--all-tags` - the same as `-a` above but keeps cache for existing Git - tags. It's useful if tags are used to track "checkpoints" of an experiment or - project. Note that both options can be combined, for example using the `-aT` - flag. +- `-T`, `--all-tags` - the same as `-a` above but applies to Git tags. It's + useful if tags are used to track "checkpoints" of an experiment or project. + Note that both options can be combined, for example using the `-aT` flag. - `-p`, `--projects` - if a single remote or a single cache is shared among different projects (e.g. a configuration like the one described diff --git a/public/static/docs/command-reference/get.md b/public/static/docs/command-reference/get.md index 405f930b0e..4e89020b34 100644 --- a/public/static/docs/command-reference/get.md +++ b/public/static/docs/command-reference/get.md @@ -58,11 +58,10 @@ name. an existing directory is specified, then the output will be placed inside of it. -- `--rev` - specific - [Git reference](https://git-scm.com/book/en/v2/Git-Internals-Git-References) - (such as a branch name, a tag, or a commit hash) of the repository to download - the file or directory from. The tip of the default branch is used by default - when this option is not specified. +- `--rev` - specific [Git revision](https://git-scm.com/docs/revisions) (such as + a branch name, a tag, or a commit hash) of the repository to download the file + or directory from. The tip of the default branch is used by default when this + option is not specified. - `--show-url` - instead of downloading the file or directory, just print the storage location (URL) of the target data. `path` is expected to represent a @@ -140,11 +139,12 @@ https://remote.dvc.org/get-started/66/2eb7f64216d9c2c1088d0a5e2c6951 ## Example: Compare different versions of data or model -`dvc get` provides the `--rev` option to specify which version of the repository -to download a data artifact from. It also has the `--out` option to -specify the location to place the artifact within the workspace. Combining these -two options allows us to do something we can't achieve with the regular -`git checkout` + `dvc checkout` process – see for example the +`dvc get` provides the `--rev` option to specify which +[Git revision](https://git-scm.com/docs/revisions) of the repository to download +a data artifact from. It also has the `--out` option to specify the +location to place the artifact within the workspace. Combining these two options +allows us to do something we can't achieve with the regular `git checkout` + +`dvc checkout` process – see for example the [Get Older Data Version](/doc/get-started/older-versions) chapter of our _Get Started_. @@ -178,10 +178,10 @@ The `model.monograms.pkl` file now contains the older version of the model. To get the most recent one, we use a similar command, but with `-o model.bigrams.pkl` and `--rev 9-bigrams-model` or even without `--rev` -(since it's the latest version anyway). In fact, in this case using `dvc pull` +(since it's the latest revision anyway). In fact, in this case using `dvc pull` with the corresponding [DVC-files](/doc/user-guide/dvc-file-format) should suffice, downloading the file as just `model.pkl`. We can then rename it to make -its version explicit: +its variant explicit: ```dvc $ dvc pull train.dvc diff --git a/public/static/docs/command-reference/import-url.md b/public/static/docs/command-reference/import-url.md index 47b808b21a..7f3a1d432b 100644 --- a/public/static/docs/command-reference/import-url.md +++ b/public/static/docs/command-reference/import-url.md @@ -135,8 +135,8 @@ Follow these instructions before each example below if you actually want to try them on your system. Start by cloning our example repo if you don't already have it. Then move into -the repo and checkout the -[version](https://github.com/iterative/example-get-started/releases/tag/2-remote) +the repo and checkout +[the revision](https://github.com/iterative/example-get-started/releases/tag/2-remote) corresponding to the [Configure](/doc/get-started/configure) chapter: ```dvc diff --git a/public/static/docs/command-reference/import.md b/public/static/docs/command-reference/import.md index 160e3b9d9b..9c67fa6846 100644 --- a/public/static/docs/command-reference/import.md +++ b/public/static/docs/command-reference/import.md @@ -75,11 +75,10 @@ data artifact from the source project. an existing directory is specified, then the output will be placed inside of it. -- `--rev` - specific - [Git reference](https://git-scm.com/book/en/v2/Git-Internals-Git-References) - (such as a branch name, a tag, or a commit hash) of the repository to download - the file or directory from. The tip of the default branch is used by default - when this option is not specified. +- `--rev` - specific [Git revision](https://git-scm.com/docs/revisions) (such as + a branch name, a tag, or a commit hash) of the repository to download the file + or directory from. The tip of the default branch is used by default when this + option is not specified. > Note that this adds a `rev` field in the import stage that fixes it to this > revision. This can impact the behavior of `dvc update`. (See @@ -129,8 +128,8 @@ outs: Several of the values above are pulled from the original stage file `model.pkl.dvc` in the external DVC repository. The `url` and `rev_lock` -subfields under `repo` are used to save the origin and version of the -dependency. +subfields under `repo` are used to save the origin and revision of the +dependency, respectively. ## Example: Fixed revisions & re-importing diff --git a/public/static/docs/command-reference/install.md b/public/static/docs/command-reference/install.md index ff2c9710a2..0f2e702cea 100644 --- a/public/static/docs/command-reference/install.md +++ b/public/static/docs/command-reference/install.md @@ -19,11 +19,11 @@ automatically. Namely: **Checkout**: For any given branch or tag, `git checkout` retrieves the -[DVC-files](/doc/user-guide/dvc-file-format) corresponding to that version. The -project's DVC-files in turn refer to data stored in -cache, but not necessarily in the workspace. Normally, -it would be necessary to run `dvc checkout` to synchronize workspace and -DVC-files. +[DVC-files](/doc/user-guide/dvc-file-format) corresponding to that +[Git revision](https://git-scm.com/docs/revisions). The project's +DVC-files in turn refer to data stored in cache, but not +necessarily in the workspace. Normally, it would be necessary to +run `dvc checkout` to synchronize workspace and DVC-files. The installed Git hook automates running `dvc checkout`. @@ -121,8 +121,8 @@ $ dvc pull --all-branches --all-tags ## Example: Checkout both DVC and Git Let's start our exploration with the impact of `dvc install` on the -`dvc checkout` command. Remember that switching from one Git repository version -to another (with `git checkout`) changes the set of +`dvc checkout` command. Remember that switching from one Git revision to another +(with `git checkout`) changes the set of [DVC-files](/doc/user-guide/dvc-file-format) in the project. This changes the set of data files that should be located in the workspace (which can be achieved with `dvc checkout`). diff --git a/public/static/docs/command-reference/metrics/diff.md b/public/static/docs/command-reference/metrics/diff.md index 4079d76008..2be0e408f1 100644 --- a/public/static/docs/command-reference/metrics/diff.md +++ b/public/static/docs/command-reference/metrics/diff.md @@ -1,8 +1,11 @@ # metrics diff Show a table of changes between -[metrics](/doc/command-reference/metrics#description) among -repository versions. +[metrics](/doc/command-reference/metrics#description) among DVC +repository revisions. + +> This command requires that the project is a +> [Git](https://git-scm.com/) repository. ## Synopsis @@ -25,9 +28,9 @@ The changes shown by this command includes the new value, and numeric difference (delta) from the previous value of metrics. They're calculated between two different [Git references](https://git-scm.com/book/en/v2/Git-Internals-Git-References) -(such as branch names, tags, or commit SHA hashes) for all metrics in the +(commit hash, branch or tag name, etc.) for all metrics in the project, found by examining all of the -[DVC-files](/doc/user-guide/dvc-file-format) in both versions. +[DVC-files](/doc/user-guide/dvc-file-format) in both revisions. The metrics to use in this command can be limited with the `--targets` option. target can also be directories (with the `-R` option), so that DVC recursively @@ -36,8 +39,8 @@ shows changes for all metric files in it. ## Options - `--targets` - specific metric files or directories to calculate metrics - differences for. If omitted (default), this command use all metric files found - in both Git references. + differences for. If omitted (default), this command uses all metric files + found in both Git revisions. - `-R`, `--recursive` - determines the metric files to use by searching each target directory and its subdirectories for DVC-files to inspect. `targets` is diff --git a/public/static/docs/command-reference/metrics/show.md b/public/static/docs/command-reference/metrics/show.md index a8d0a45fac..e29d6c0e22 100644 --- a/public/static/docs/command-reference/metrics/show.md +++ b/public/static/docs/command-reference/metrics/show.md @@ -39,7 +39,7 @@ extension.) > See `dvc metrics modify` to learn how to apply `-t` and `-x` permanently. See also `dvc metrics diff` to show changes in metrics between different -repository versions. +repository [revisions](https://git-scm.com/docs/revisions). ## Options diff --git a/public/static/docs/command-reference/pull.md b/public/static/docs/command-reference/pull.md index 3b97e111da..fcc2bc3891 100644 --- a/public/static/docs/command-reference/pull.md +++ b/public/static/docs/command-reference/pull.md @@ -40,8 +40,8 @@ With no arguments, just `dvc pull` or `dvc pull --remote REMOTE`, it downloads only the files (or directories) missing from the workspace by searching all [DVC-files](/doc/user-guide/dvc-file-format) currently in the project. It will not download files associated with earlier -versions or branches of the repository if using Git, nor will it download files -that have not changed. +[revisions](https://git-scm.com/docs/revisions) of the repository +(if using Git), nor will it download files that have not changed. The command `dvc status -c` can list files referenced in current DVC-files, but missing in the cache. It can be used to see what files `dvc pull` diff --git a/public/static/docs/command-reference/push.md b/public/static/docs/command-reference/push.md index 219c20bb98..9dc17c7b30 100644 --- a/public/static/docs/command-reference/push.md +++ b/public/static/docs/command-reference/push.md @@ -52,9 +52,9 @@ configure a remote. With no arguments, just `dvc push` or `dvc push --remote REMOTE`, it uploads only the files (or directories) that are new in the local repository to remote -storage. It will not upload files associated with earlier versions or branches -of the project directory, nor will it upload files that have not -changed. +storage. It will not upload files associated with earlier +[revisions](https://git-scm.com/docs/revisions) of the repository +(if using Git), nor will it upload files that have not changed. The `dvc status -c` command can list files tracked by DVC that are new in the cache (compared to the default remote.) It can be used to see what files diff --git a/public/static/docs/command-reference/remote/modify.md b/public/static/docs/command-reference/remote/modify.md index 416a780bfb..2b9a1b1344 100644 --- a/public/static/docs/command-reference/remote/modify.md +++ b/public/static/docs/command-reference/remote/modify.md @@ -182,7 +182,7 @@ these settings, you could use the following options: > identifiable by `id` (AWS Canonical User ID), `emailAddress` or `uri` > (predefined group). - > **References**: + > **Sources** > > - [ACL Overview - Permissions](https://docs.aws.amazon.com/AmazonS3/latest/dev/acl-overview.html#permissions) > - [Put Object ACL](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObjectAcl.html) diff --git a/public/static/docs/command-reference/status.md b/public/static/docs/command-reference/status.md index 36baff33de..922d017fe0 100644 --- a/public/static/docs/command-reference/status.md +++ b/public/static/docs/command-reference/status.md @@ -1,9 +1,9 @@ # status Show changes in the project -[pipelines](/doc/command-reference/pipeline), as well as mismatches either -between the cache and workspace files, or between the -cache and remote storage. +[pipelines](/doc/command-reference/pipeline), as well as file mismatches either +between the cache and workspace, or between the cache +and remote storage. ## Synopsis @@ -21,8 +21,8 @@ positional arguments: `dvc status` searches for changes in the existing pipelines, either showing which [stages](/doc/command-reference/run) have changed in the workspace (including uncommitted local changes) and must be reproduced (with `dvc repro`), -or differences between cache vs. remote storage (meaning `dvc push` or -`dvc pull` should be run to synchronize them). The two modes, _local_ and +or differences between cache vs. remote storage (meaning `dvc push` +or `dvc pull` should be run to synchronize them). The two modes, _local_ and _cloud_ are triggered by using the `--cloud` or `--remote` options: | Mode | CLI Option | Description | @@ -32,12 +32,12 @@ _cloud_ are triggered by using the `--cloud` or `--remote` options: | remote | `--cloud` | Comparisons are made between the cache, and the default remote, defined with `dvc remote --default` command. | DVC determines data and code files to compare by analyzing all -[DVC-files](/doc/user-guide/dvc-file-format) in the project +[DVC-files](/doc/user-guide/dvc-file-format) in the repository (`--all-branches` and `--all-tags` in the `cloud` mode compare multiple -workspace versions). The comparison can be limited to specific DVC-files by -listing them as `targets`. Changes are reported only against the given -`targets`. When combined with the `--with-deps` option, a search is made for -changes in other stages that affect the target. +[Git revisions](https://git-scm.com/docs/revisions)). The comparison can be +limited to specific DVC-files by listing them as `targets`. Changes are reported +only against the given `targets`. When combined with the `--with-deps` option, a +search is made for changes in other stages that affect the target. In the `local` mode, changes are detected through the checksum of every file listed in every DVC-file in question against the corresponding file in the file @@ -53,12 +53,10 @@ This indicates that no differences were detected, and therefore no stages would be executed by `dvc repro`. If instead, differences are detected, `dvc status` lists those changes. For each -DVC-file (stage) with differences, the changes in _dependencies_ and/or -_outputs_ that differ are listed. For each item listed, either the file name or -the checksum is shown, and additionally a status word is shown describing the -changes (described below). This changes list provides a reference to both the -status of a DVC-file, as well as the changes to individual dependencies and -outputs described in it. +DVC-file (stage) with differences, the changes in dependencies +and/or outputs that differ are listed. For each item listed, either +the file name or the checksum is shown, and additionally a status word is shown +describing the changes (described below). - _changed checksum_ means that the DVC-file checksum has changed (e.g. someone manually edited the file). @@ -115,14 +113,14 @@ workspace) is different from remote storage. Bringing the two into sync requires name defined using the `dvc remote` command. Implies `--cloud`. - `-a`, `--all-branches` - compares cache content against all Git branches - instead of checking just the current workspace version. This basically runs - the same status command in all the branches of this repo. The corresponding - branches are shown in the status output. Applies only if `--cloud` or a `-r` - remote is specified. + instead of checking just the current revision. This basically runs the same + status command in all the branches of this repo. The corresponding branches + are shown in the status output. Applies only if `--cloud` or a `-r` remote is + specified. - `-T`, `--all-tags` - compares cache content against all Git tags instead of - checking just the current workspace version. Similar to `-a` above. Note that - both options can be combined, for example using the `-aT` flag. + checking just the current revision. Similar to `-a` above. Note that both + options can be combined, for example using the `-aT` flag. - `-j JOBS`, `--jobs JOBS` - specifies the number of jobs DVC can use to retrieve information from remote servers. This only applies when the `--cloud` diff --git a/public/static/docs/get-started/experiments.md b/public/static/docs/get-started/experiments.md index a8b852610c..1f7fb337b8 100644 --- a/public/static/docs/get-started/experiments.md +++ b/public/static/docs/get-started/experiments.md @@ -35,7 +35,7 @@ $ git commit -am "Reproduce model using bigrams" > for more details. Now, we have a new `model.pkl` captured and saved. To get back to the initial -version we run `git checkout` along with `dvc checkout` command: +version, we run `git checkout` along with `dvc checkout` command: ``` $ git checkout baseline-experiment diff --git a/public/static/docs/get-started/import-data.md b/public/static/docs/get-started/import-data.md index daa9cf79e2..fcaf65782e 100644 --- a/public/static/docs/get-started/import-data.md +++ b/public/static/docs/get-started/import-data.md @@ -68,7 +68,7 @@ outs: ``` The `url` and `rev_lock` subfields under `repo` are used to save the origin and -version of the dependency. +[revision](https://git-scm.com/docs/revisions) of the dependency, respectively. > Note that `dvc update` updates the `rev_lock` field of the corresponding > DVC-file (when there are changes to bring in). diff --git a/public/static/docs/tutorials/pipelines.md b/public/static/docs/tutorials/pipelines.md index 6674879b5a..9638d33830 100644 --- a/public/static/docs/tutorials/pipelines.md +++ b/public/static/docs/tutorials/pipelines.md @@ -5,7 +5,7 @@ Let's explore the natural language processing ([NLP](https://en.wikipedia.org/wiki/Natural_language_processing)) problem of predicting tags for a given StackOverflow question. For example, we want a classifier that can predict posts about the Python language by tagging them -`python`. (This is a short version of the [Tutorial](/doc/tutorials/deep).) +`python`. (This is a short version of the [Deep Tutorial](/doc/tutorials/deep).) In this example, we will focus on building a simple ML [pipeline](/doc/command-reference/pipeline) that takes an archive with diff --git a/public/static/docs/tutorials/versioning.md b/public/static/docs/tutorials/versioning.md index fb9d52ef2a..d626610506 100644 --- a/public/static/docs/tutorials/versioning.md +++ b/public/static/docs/tutorials/versioning.md @@ -15,8 +15,8 @@ to build a powerful image classifier using a pretty small dataset. We first train a classifier model using 1000 labeled images, then we double the number of images (2000) and retrain our model. We capture both datasets and -classifier results and show how to use `dvc checkout` along with `git checkout` -to switch between different versions. +classifier results and show how to use `dvc checkout` to switch between model +versions. The specific algorithm used to train and validate the classifier is not important, and no prior knowledge of Keras is required. We'll reuse the @@ -245,7 +245,7 @@ That's it! We have a second model and dataset saved and pointers to them committed with Git. Let's now look at how DVC can help us go back to the previous version if we need to. -## Switching between versions +## Switching between model versions The DVC command that helps get a specific committed version of data is designed to be similar to `git checkout`. All we need to do in our case is to @@ -268,9 +268,8 @@ data files, model, all of it. DVC optimizes this operation to avoid copying data or model files each time. So `dvc checkout` is quick even if you have large datasets, data files, or models. -On the other hand, if we want to keep the current version of the code and go -back to the previous dataset only, we can do something like this (make sure that -you don't have uncommitted changes in `data.dvc`): +On the other hand, if we want to keep the current revision of the code, but go +back to the previous dataset version, we can do something like this: ```dvc $ git checkout v1.0 data.dvc @@ -279,7 +278,7 @@ $ dvc checkout data.dvc If you run `git status` you'll see that `data.dvc` is modified and currently points to the `v1.0` of the dataset, while code and model files are from the -`v2.0` version. +`v2.0` [revision](https://git-scm.com/docs/revisions).
@@ -312,8 +311,8 @@ When you have a script that takes some data as an input and produces other data outputs, a better way to capture them is to use `dvc run`: > If you tried the commands in the -> [Switching between versions](#switching-between-versions) section, go back to -> the master branch code and data with: +> [Switching between model versions](#switching-between-model-versions) section, +> go back to the master branch code and data with: > > ```dvc > $ git checkout master diff --git a/public/static/docs/understanding-dvc/collaboration-issues.md b/public/static/docs/understanding-dvc/collaboration-issues.md index ace3d312f1..40c254351f 100644 --- a/public/static/docs/understanding-dvc/collaboration-issues.md +++ b/public/static/docs/understanding-dvc/collaboration-issues.md @@ -14,9 +14,9 @@ formalized. Common questions need to be answered in an unified, principled way. ### Source code and data versioning -- How do you avoid discrepancies between versions of the source code and - versions of the data files when the data cannot fit into a traditional - repository format? +- How do you avoid discrepancies between + [revisions](https://git-scm.com/docs/revisions) of source code and versions of + data files, when the data cannot fit into a traditional repository? ### Experiment time log diff --git a/public/static/docs/understanding-dvc/what-is-dvc.md b/public/static/docs/understanding-dvc/what-is-dvc.md index 4d8ab011b3..d242777d44 100644 --- a/public/static/docs/understanding-dvc/what-is-dvc.md +++ b/public/static/docs/understanding-dvc/what-is-dvc.md @@ -18,15 +18,17 @@ branch or commit. DVC uses a few core concepts: -- **Experiment**: Equivalent to a Git repository version. Each experiment - (extract new features, change model hyperparameters, data cleaning, add a new - data source) should be performed in a separate branch and then merged into the +- **Experiment**: Equivalent to a + [Git revision](https://git-scm.com/docs/revisions). Each experiment (extract + new features, change model hyperparameters, data cleaning, add a new data + source) should be performed in a separate branch and then merged into the master branch only if the experiment is successful. DVC allows experiments to be integrated into a Git repository history and NEVER needs to recompute the results after a successful merge. - **Experiment state** or state: Equivalent to a Git snapshot (all committed - files). Git checksum, branch name, or tag can be used as a reference to a + files). A Git commit hash, branch or tag name, etc. can be used as a + [reference](https://git-scm.com/book/en/v2/Git-Internals-Git-References) to an experiment state. - **Reproducibility**: Action to reproduce an experiment state. This action diff --git a/public/static/docs/use-cases/data-registries.md b/public/static/docs/use-cases/data-registries.md index a2cb55f8a5..5fd63e777b 100644 --- a/public/static/docs/use-cases/data-registries.md +++ b/public/static/docs/use-cases/data-registries.md @@ -109,7 +109,7 @@ This downloads `music/songs/` from the project's current working directory (anywhere in the file system with user write access). > Note that this command (as well as `dvc import`) has a `--rev` option to -> download specific versions of the data. +> download specific revision of the data. ### Import workflow @@ -137,7 +137,8 @@ $ dvc update dataset.dvc ``` `dvc update` downloads new and changed files, or removes deleted ones, from -`images/faces/`, based on the latest version of the source project. It also +`images/faces/`, based on the latest +[revision](https://git-scm.com/docs/revisions) of the source project. It also updates the project dependency metadata in the import stage (DVC-file). ### Programatic reusability of DVC data diff --git a/public/static/docs/use-cases/versioning-data-and-model-files.md b/public/static/docs/use-cases/versioning-data-and-model-files.md index f9a19fa8da..16e9957564 100644 --- a/public/static/docs/use-cases/versioning-data-and-model-files.md +++ b/public/static/docs/use-cases/versioning-data-and-model-files.md @@ -84,8 +84,8 @@ full workspace checkout, or checkout of a specific data or model file. Let's consider the full checkout first. It's quite straightforward: > `v1.0` below is a Git tag that should be created in advance to identify the -> dataset version you are interested in. Any Git version (for example `HEAD^` or -> a commit hash) can be used instead. +> dataset version you are interested in. Any revision (for example `HEAD^` or a +> commit hash) can be used instead. ```dvc $ git checkout v1.0 @@ -108,8 +108,8 @@ $ dvc checkout data.dvc ``` If you run `git status` you will see that `data.dvc` is modified and currently -points to the version `v1.0` of the dataset. Meanwhile, code and model files are -their latest versions. +points to the `v1.0` [revision](https://git-scm.com/docs/revisions) of the +repository. Meanwhile, code and model files are their latest versions. ![](/static/img/versioning.png) diff --git a/public/static/docs/user-guide/analytics.md b/public/static/docs/user-guide/analytics.md index a15888f431..2519f19f11 100644 --- a/public/static/docs/user-guide/analytics.md +++ b/public/static/docs/user-guide/analytics.md @@ -12,8 +12,8 @@ and features based on how, where and when people use DVC. For example: - If reflinks (depends on a file system type) are supported for most users, we can keep cache protected mode off by default (see `dvc unprotect`). -- Collecting the OS version and the way DVC was installed allows us to decide - what versions of OS to prioritize and support. +- Collecting OS information and the way DVC was installed allows us to decide + which OS platforms and versions to support and prioritize. - If usage of some command is negligible small it makes us think about issues with a command or documentation. @@ -25,12 +25,11 @@ User and event data have a 14 month retention period. DVC's analytics record the following information per event: -- The DVC version, e.g. `0.22.0` -- The operating system information, e.g. `linux`, `ubuntu`, `14.04`, etc -- The underlying version control system, e.g. `git` +- The DVC version, e.g. `0.8.0` +- The operating system information, e.g. `linux`, `ubuntu`, `14.04`, etc. - Command type, e.g. `CmdDataPull` - Command return code, e.g. `1` -- Way the DVC was installed, e.g. `binary` +- Way the DVC was installed, e.g. `Binary` - A DVC analytics user ID (e.g. `8ca59a29-ddd9-4247-992a-9b4775732aad`), generated by [`uuid`](https://docs.python.org/3/library/uuid.html) diff --git a/public/static/docs/user-guide/contributing/core.md b/public/static/docs/user-guide/contributing/core.md index 381a06f15d..fd28c66811 100644 --- a/public/static/docs/user-guide/contributing/core.md +++ b/public/static/docs/user-guide/contributing/core.md @@ -202,7 +202,7 @@ Install [Node.js](https://nodejs.org/en/download/) and then install and run Azurite: ```dvc -$ npm install -g 'azurite@<3' # Need 2.x version +$ npm install -g 'azurite@<3' $ mkdir azurite $ azurite -s -l azurite -d azurite/debug.log ``` diff --git a/public/static/docs/user-guide/contributing/docs.md b/public/static/docs/user-guide/contributing/docs.md index 2480495eb4..88cb5a7edc 100644 --- a/public/static/docs/user-guide/contributing/docs.md +++ b/public/static/docs/user-guide/contributing/docs.md @@ -14,8 +14,8 @@ To contribute documentation, these are the relevant locations under (`docs/`): [Markdown](https://guides.github.com/features/mastering-markdown/) files of the different pages to render dynamically in the browser. - [Images](https://github.com/iterative/dvc.org/tree/master/public/static/img) - (`img/`): Add new images (png, svg, etc.) here. Reference them from the - Markdown files like this: `![](/static/img/reproducibility.png)`. + (`img/`): Add new images (png, svg, etc.) here. Include them in Markdown files + like this: `![](/static/img/.gif)`. - [Sections](https://github.com/iterative/dvc.org/tree/master/public/static/docs/sidebar.json) (`docs/sidebar.json`): Edit it to register a new section for the navigation menu. diff --git a/public/static/docs/user-guide/dvc-file-format.md b/public/static/docs/user-guide/dvc-file-format.md index 7458248dfe..797a7cdd56 100644 --- a/public/static/docs/user-guide/dvc-file-format.md +++ b/public/static/docs/user-guide/dvc-file-format.md @@ -67,12 +67,10 @@ A dependency entry consists of a pair of fields: - `url`: URL of Git repository with source DVC project - `rev`: Only present when the `--rev` option of `dvc import` is used. - Specific - [Git reference](https://git-scm.com/book/en/v2/Git-Internals-Git-References) - (such as a branch name or a tag) used to import the dependency from. - - `rev_lock`: Revision or version (Git commit hash) of the external DVC - repository at the time of importing or updating (with `dvc update`) - the dependency. + Specific [Git revision](https://git-scm.com/docs/revisions) (such as a + branch name or a tag) used to import the dependency from. + - `rev_lock`: Git commit hash of the external DVC repository at + the time of importing or updating (with `dvc update`) the dependency. > See the examples in > [External Dependencies](/doc/user-guide/external-dependencies) for more diff --git a/public/static/docs/user-guide/external-dependencies.md b/public/static/docs/user-guide/external-dependencies.md index 36b8a8672a..cbfa7cc8a4 100644 --- a/public/static/docs/user-guide/external-dependencies.md +++ b/public/static/docs/user-guide/external-dependencies.md @@ -183,6 +183,6 @@ outs: ``` The `url` and `rev_lock` subfields under `repo` are used to save the origin and -version of the dependency. +[revision](https://git-scm.com/docs/revisions) of the dependency, respectively.
diff --git a/public/static/docs/user-guide/managing-external-data.md b/public/static/docs/user-guide/managing-external-data.md index 27a330dded..7fba5a10ed 100644 --- a/public/static/docs/user-guide/managing-external-data.md +++ b/public/static/docs/user-guide/managing-external-data.md @@ -26,14 +26,12 @@ DVC will track changes in those files and will reflect so in your pipeline > Note that these are a subset of the remote storage types supported by > `dvc remote`. -In order to specify an external output for a stage file use the usual `-o` and -`-O` options with the `dvc run` command, but with the external path or URL -pointing to your desired files. For cached external outputs (specified using -`-o`) you will need to -[setup an external cache](/doc/command-reference/config#cache) location that -will be used by DVC to store versions of your external file. Non-cached external -outputs (specified using `-O`) do not require an external cache to -be setup. +In order to specify an external output for a stage file, use the usual `-o` or +`-O` options of the `dvc run` command, but with the external path or URL +pointing to the file in question. For cached external outputs +(`-o`) you will need to +[setup an external cache](/doc/command-reference/config#cache) location. +Non-cached external outputs (`-O`) do not require an external cache to be setup. > Avoid using the same remote location that you are using for `dvc push`, > `dvc pull`, `dvc fetch` as external cache for your external outputs, because diff --git a/public/static/docs/user-guide/running-dvc-on-windows.md b/public/static/docs/user-guide/running-dvc-on-windows.md index c88f7118f9..5fbd6245cd 100644 --- a/public/static/docs/user-guide/running-dvc-on-windows.md +++ b/public/static/docs/user-guide/running-dvc-on-windows.md @@ -24,8 +24,8 @@ Its also possible to enjoy a full Linux terminal experience with the ## Disable short-file name generation -With NTFS, user may want to disable `8dot3` as per -[this reference]() +With NTFS, users may want to disable `8dot3` as per +[this article](https://support.microsoft.com/en-us/help/121007/how-to-disable-8-3-file-name-creation-on-ntfs-partitions) to disable the short-file name generation. It is important to do so for better performance when the user has over 300K files in a single directory. @@ -51,9 +51,8 @@ guide. ## Avoid directories with large number of files The performance of NTFS degrades while handling large volumes of files in a -directory. -[Here](https://stackoverflow.com/questions/197162/ntfs-performance-and-large-volumes-of-files-and-directories) -is the resource for reference. +directory, as explained in +[this issue](https://stackoverflow.com/questions/197162/ntfs-performance-and-large-volumes-of-files-and-directories). ## Enabling paging with `less` From f10cf2b2c0b154a4e66bcac22b6685e55ac0aa79 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 29 Jan 2020 01:05:42 -0600 Subject: [PATCH 08/27] term: review usage of "hash", "commit hash", "SHA", and "MD5" per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-348856746 --- public/static/docs/changelog/0.18.md | 4 ++-- public/static/docs/command-reference/add.md | 2 +- .../static/docs/command-reference/checkout.md | 2 +- public/static/docs/command-reference/gc.md | 11 +++++----- public/static/docs/command-reference/get.md | 6 +++--- .../static/docs/command-reference/import.md | 20 +++++++++---------- .../docs/command-reference/metrics/diff.md | 2 +- public/static/docs/command-reference/repro.md | 6 ++++-- public/static/docs/command-reference/run.md | 2 +- .../static/docs/command-reference/update.md | 2 +- .../static/docs/command-reference/version.md | 10 +++++----- public/static/docs/get-started/add-files.md | 6 +++--- public/static/docs/get-started/store-data.md | 7 ++++--- public/static/docs/install/pre-release.md | 6 +++--- .../docs/tutorials/deep/define-ml-pipeline.md | 2 +- .../docs/tutorials/deep/reproducibility.md | 4 ++-- public/static/docs/tutorials/pipelines.md | 6 +++--- public/static/docs/tutorials/versioning.md | 5 +++-- .../understanding-dvc/related-technologies.md | 4 ++-- .../docs/understanding-dvc/what-is-dvc.md | 2 +- .../versioning-data-and-model-files.md | 2 +- .../static/docs/user-guide/dvc-file-format.md | 8 ++++---- 22 files changed, 62 insertions(+), 57 deletions(-) diff --git a/public/static/docs/changelog/0.18.md b/public/static/docs/changelog/0.18.md index 984691e3a9..945f644a56 100644 --- a/public/static/docs/changelog/0.18.md +++ b/public/static/docs/changelog/0.18.md @@ -28,8 +28,8 @@ really excited to share the progress with you: - 🙂 **Usability improvements** - DVC interface got more informative and easier to use: - - More heavy operations render dynamic progress bar (e.g. hash computation): - ![](/static/img/0.18-progress.gif) + - More heavy operations render dynamic progress bar (e.g. checksum + computation): ![](/static/img/0.18-progress.gif) - Pipeline visualization via command line. Just run `dvc pipeline show` with `ascii` option and a target: ![](/static/img/0.18-pipeline.gif) diff --git a/public/static/docs/command-reference/add.md b/public/static/docs/command-reference/add.md index 45989312a5..c74b6f421e 100644 --- a/public/static/docs/command-reference/add.md +++ b/public/static/docs/command-reference/add.md @@ -197,7 +197,7 @@ Saving information to 'pics.dvc'. There are no [DVC-files](/doc/user-guide/dvc-file-format) generated within this directory structure, but the images are all added to the cache. DVC -prints a message about this, mentioning that `md5` values are computed for each +prints a message about this, mentioning that MD5 checksums are computed for each directory. A single `pics.dvc` DVC-file is generated for the top-level directory, and it contains: diff --git a/public/static/docs/command-reference/checkout.md b/public/static/docs/command-reference/checkout.md index af024f859c..c60d4b3b42 100644 --- a/public/static/docs/command-reference/checkout.md +++ b/public/static/docs/command-reference/checkout.md @@ -179,7 +179,7 @@ outs: path: model.pkl ``` -But if you check `model.pkl`, the file hash is still the same: +But if you check `model.pkl`, the file checksum is still the same: ```dvc $ md5 model.pkl diff --git a/public/static/docs/command-reference/gc.md b/public/static/docs/command-reference/gc.md index ae986ad492..a8b8f1c981 100644 --- a/public/static/docs/command-reference/gc.md +++ b/public/static/docs/command-reference/gc.md @@ -33,12 +33,13 @@ restored using `dvc fetch`, as long as they have previously been uploaded with ## Options -- `-a`, `--all-branches` - keep cached objects referenced from the latest commit - across all Git branches. It should be used if you want to keep data for the - latest experiment revisions. Especially, if you intend to use `dvc gc -c` this - option is much safer. +- `-a`, `--all-branches` - keep cached objects referenced from the latest + [revisions](https://git-scm.com/docs/revisions) across all Git branches (tip + commits). It should be used if you want to keep data for the latest experiment + revisions. Especially, if you intend to use `dvc gc -c` this option is much + safer. -- `-T`, `--all-tags` - the same as `-a` above but applies to Git tags. It's +- `-T`, `--all-tags` - the same as `-a` above, but applies to Git tags. It's useful if tags are used to track "checkpoints" of an experiment or project. Note that both options can be combined, for example using the `-aT` flag. diff --git a/public/static/docs/command-reference/get.md b/public/static/docs/command-reference/get.md index 4e89020b34..b98bb179c0 100644 --- a/public/static/docs/command-reference/get.md +++ b/public/static/docs/command-reference/get.md @@ -59,9 +59,9 @@ name. it. - `--rev` - specific [Git revision](https://git-scm.com/docs/revisions) (such as - a branch name, a tag, or a commit hash) of the repository to download the file - or directory from. The tip of the default branch is used by default when this - option is not specified. + a commit SHA hash, or a branch or tag name) of the repository to download the + file or directory from. The tip of the default branch is used by default when + this option is not specified. - `--show-url` - instead of downloading the file or directory, just print the storage location (URL) of the target data. `path` is expected to represent a diff --git a/public/static/docs/command-reference/import.md b/public/static/docs/command-reference/import.md index 9c67fa6846..2891815aef 100644 --- a/public/static/docs/command-reference/import.md +++ b/public/static/docs/command-reference/import.md @@ -76,9 +76,9 @@ data artifact from the source project. it. - `--rev` - specific [Git revision](https://git-scm.com/docs/revisions) (such as - a branch name, a tag, or a commit hash) of the repository to download the file - or directory from. The tip of the default branch is used by default when this - option is not specified. + a commit SHA hash, or a branch or tag name) of the repository to download the + file or directory from. The tip of the default branch is used by default when + this option is not specified. > Note that this adds a `rev` field in the import stage that fixes it to this > revision. This can impact the behavior of `dvc update`. (See @@ -159,14 +159,14 @@ deps: ``` If the -[Git reference](https://git-scm.com/book/en/v2/Git-Internals-Git-References) +[Git-reference](https://git-scm.com/book/en/v2/Git-Internals-Git-References) moves (e.g. a branch), you may use `dvc update` to bring the data up to date. -However, for typically static references (e.g. tags), or for commits hashes, in -order to actually "update" an import, it's necessary to **re-import the data** -instead, by using `dvc import` again without or with a different `--rev`. This -will overwrite the import stage (DVC-file), either removing or replacing the -`rev` field, respectively. This can produce an import stage that is able to be -updated normally with `dvc update` going forward. For example: +However, for typically static references (e.g. tags), or for commit SHA hashes, +in order to actually "update" an import, it's necessary to **re-import the +data** instead, by using `dvc import` again without or with a different `--rev`. +This will overwrite the import stage (DVC-file), either removing or replacing +the `rev` field, respectively. This can produce an import stage that is able to +be updated normally with `dvc update` going forward. For example: ```dvc $ dvc import --rev master \ diff --git a/public/static/docs/command-reference/metrics/diff.md b/public/static/docs/command-reference/metrics/diff.md index 2be0e408f1..a486891709 100644 --- a/public/static/docs/command-reference/metrics/diff.md +++ b/public/static/docs/command-reference/metrics/diff.md @@ -28,7 +28,7 @@ The changes shown by this command includes the new value, and numeric difference (delta) from the previous value of metrics. They're calculated between two different [Git references](https://git-scm.com/book/en/v2/Git-Internals-Git-References) -(commit hash, branch or tag name, etc.) for all metrics in the +(commit SHA hash, branch or tag name, etc.) for all metrics in the project, found by examining all of the [DVC-files](/doc/user-guide/dvc-file-format) in both revisions. diff --git a/public/static/docs/command-reference/repro.md b/public/static/docs/command-reference/repro.md index f6bda51b33..ff690a63ce 100644 --- a/public/static/docs/command-reference/repro.md +++ b/public/static/docs/command-reference/repro.md @@ -43,7 +43,8 @@ before running the stages that produce them. `dvc repro` does not run `dvc fetch`, `dvc pull` or `dvc checkout` to get data files, intermediate or final results. It saves all the data files, intermediate or final results into the DVC cache (unless `--no-commit` option is -specified), and updates stage files with the new checksum information. +specified), and updates stage files with the new dependency/output file or +directory checksums. ### Parallel stage execution @@ -239,7 +240,8 @@ Saving information to 'Dvcfile'. ``` You can now check that `Dvcfile` and `count.txt` have been updated with the new -information, new `md5` checksums and a new result respectively. +information and updated dependency/output file checksums, and a new result, +respectively. ## Example: Downstream diff --git a/public/static/docs/command-reference/run.md b/public/static/docs/command-reference/run.md index 6bf323c083..7e9edc5090 100644 --- a/public/static/docs/command-reference/run.md +++ b/public/static/docs/command-reference/run.md @@ -132,7 +132,7 @@ data pipeline (e.g. random numbers, time functions, hardware dependency, etc.) - `--no-exec` - create a stage file, but do not execute the `command` defined in it, nor take dependencies or outputs under DVC control. In the DVC-file - contents, the `md5` hash sums will be empty; They will be populated the next + contents, the file checksums will be empty; They will be populated the next time this stage is actually executed. This is useful if, for example, you need to build a pipeline (dependency graph) first, and then run it all at once. diff --git a/public/static/docs/command-reference/update.md b/public/static/docs/command-reference/update.md index 2bc68a818a..15c383251f 100644 --- a/public/static/docs/command-reference/update.md +++ b/public/static/docs/command-reference/update.md @@ -32,7 +32,7 @@ Another detail to note is that when the `--rev` (revision) option of kind of [Git reference](https://git-scm.com/book/en/v2/Git-Internals-Git-References) this is, for example a branch or a tag. For typically static references (e.g. -tags), or for commits hashes, `dvc update` will not have any effect on the +tags), or for commit SHA hashes, `dvc update` will not have any effect on the import. Refer to the [re-importing example](/doc/command-reference/import#example-fixed-revisions-re-importing) to learn how to "update" fixed-revision imports. diff --git a/public/static/docs/command-reference/version.md b/public/static/docs/command-reference/version.md index b1f3c8fabe..c3587a31a3 100644 --- a/public/static/docs/command-reference/version.md +++ b/public/static/docs/command-reference/version.md @@ -16,7 +16,7 @@ system/environment: | Line | Detail | | ------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| [`DVC version`](#components-of-dvc-version) | Version of DVC (along with a Git commit hash in case of a development version) | +| [`DVC version`](#components-of-dvc-version) | Version of DVC (along with a Git commit SHA hash in case of a development version) | | `Python version` | Version of the Python being used on the environment in which DVC is initialized | | `Platform` | Information about the operating system of the machine | | [`Binary`](#what-we-mean-by-binary) | Shows whether DVC was installed from a package or from a binary release | @@ -53,10 +53,10 @@ The detail of DVC version depends upon the way of installing DVC. that might not be ready to publish yet. Therefore installing using the above command might have issues regarding its usage. So to trace any error reported with this setup, we need to know exactly which version is being used. For this - we rely on a git commit hash that is displayed in this command's output like - this: `0.40.2+292cab.mod`. The part before `+` is the `_BASE_VERSION`, and the - following part is the latest `master` branch commit hash. The optional suffix - `.mod` means that code is modified. + we rely on a Git commit SHA hash, that is displayed in this command's output + like this: `0.40.2+292cab.mod`. The part before `+` is the `_BASE_VERSION`, + and the following part is the SHA of the tip of the `master` branch. The + optional suffix `.mod` means that code is modified. ### What we mean by "Binary" diff --git a/public/static/docs/get-started/add-files.md b/public/static/docs/get-started/add-files.md index 63f7521b9c..54d4bc05bd 100644 --- a/public/static/docs/get-started/add-files.md +++ b/public/static/docs/get-started/add-files.md @@ -52,9 +52,9 @@ $ ls -R .dvc/cache 04afb96060aad90176268345e10355 ``` -`a304afb96060aad90176268345e10355` from above is an MD5 hash of the `data.xml` -file we just added to DVC. And if you check the `data/data.xml.dvc` DVC-file you -will see that it has this hash inside. +`a304afb96060aad90176268345e10355` above is the file checksum of the `data.xml` +file we just added to DVC. If you check the `data/data.xml.dvc` DVC-file, you +will see that it has this string inside. ### Important note on cache performance diff --git a/public/static/docs/get-started/store-data.md b/public/static/docs/get-started/store-data.md index 7cf79cb40c..4ce4ac2663 100644 --- a/public/static/docs/get-started/store-data.md +++ b/public/static/docs/get-started/store-data.md @@ -35,8 +35,9 @@ $ ls -R /tmp/dvc-storage 04afb96060aad90176268345e10355 ``` -where `a304afb96060aad90176268345e10355` is an MD5 hash of the `data.xml` file, -and if you check the `data.xml.dvc` [DVC-file](/doc/user-guide/dvc-file-format) -you will see that it has this hash inside. +`a304afb96060aad90176268345e10355` above is the file checksum of the `data.xml` +file. If you check the `data.xml.dvc` +[DVC-file](/doc/user-guide/dvc-file-format), you will see that it has this +string inside. diff --git a/public/static/docs/install/pre-release.md b/public/static/docs/install/pre-release.md index 1f31b1353a..887011b0c5 100644 --- a/public/static/docs/install/pre-release.md +++ b/public/static/docs/install/pre-release.md @@ -15,9 +15,9 @@ $ pip install git+https://github.com/iterative/dvc ``` > `gitpython` allows the installation process to generate a DVC version using -> the current Git commit hash. This lets us to distinguish official DVC releases -> (e.g. `0.64.3`) from a development version (e.g. `0.64.3-9c7381`). For more -> information on our versioning convention, refer to +> the current Git commit SHA hash. This lets us to distinguish official DVC +> releases (e.g. `0.64.3`) from a development version (e.g. `0.64.3-9c7381`). +> For more information on our versioning convention, refer to > [Components of DVC version](/doc/command-reference/version#components-of-dvc-version). To install a development version for contributing to the project, please refer diff --git a/public/static/docs/tutorials/deep/define-ml-pipeline.md b/public/static/docs/tutorials/deep/define-ml-pipeline.md index f74bc374da..198d0fb5dc 100644 --- a/public/static/docs/tutorials/deep/define-ml-pipeline.md +++ b/public/static/docs/tutorials/deep/define-ml-pipeline.md @@ -69,7 +69,7 @@ need to run `dvc unprotect` or `dvc remove` first (see the If you take a look at the [DVC-file](/doc/user-guide/dvc-file-format) created by `dvc add`, you will see that outputs are tracked in the `outs` field. In this file, only one output is specified. The output contains the data -file path in the repository and its MD5 checksum. This checksum determines a +file path in the repository and its MD5 checksum. This checksum determines the location of the actual content file in the [cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory), `.dvc/cache`. diff --git a/public/static/docs/tutorials/deep/reproducibility.md b/public/static/docs/tutorials/deep/reproducibility.md index 25d1e7024f..f8c19c8909 100644 --- a/public/static/docs/tutorials/deep/reproducibility.md +++ b/public/static/docs/tutorials/deep/reproducibility.md @@ -116,7 +116,7 @@ master: Let's keep the result in the repository. Later we can find out why bigrams don't add value to the current model and change that. -Many DVC-files were changed. This happened due to MD5 checksum changes. +Many DVC-files were changed. This happened due to file checksum changes. ```dvc $ git status -s @@ -233,7 +233,7 @@ CONFLICT (content): Merge conflict in Dvcfile Automatic merge failed; fix conflicts and then commit the result. ``` -The merge has a few conflicts. All of the conflicts are related to MD5 checksum +The merge has a few conflicts. All of the conflicts are related to file checksum mismatches in the branches. You can properly merge conflicts by prioritizing the checksums from the bigrams branch: that is, by removing all checksums of the other branch. diff --git a/public/static/docs/tutorials/pipelines.md b/public/static/docs/tutorials/pipelines.md index 9638d33830..1038629952 100644 --- a/public/static/docs/tutorials/pipelines.md +++ b/public/static/docs/tutorials/pipelines.md @@ -182,7 +182,7 @@ outs: ``` Just like the DVC-file we created earlier with `dvc add`, this stage file uses -checksums that point to the cache to describe and version control dependencies +checksums that point to the cache, to describe and version control dependencies and outputs. Output `data/Posts.xml` file is saved as `.dvc/cache/a3/04afb96060aad90176268345e10355` and linked (or copied) to the workspace, as well as added to `.gitignore`. @@ -193,8 +193,8 @@ stages) we need to apply. This is important when you run `dvc repro` to regenerate the final or intermediate result. Second, hopefully it's clear by now that the actual data is stored in the -`.dvc/cache` directory, each file having a name based on an MD5 hash. This cache -is similar to Git's +`.dvc/cache` directory, each file having a name based on an `md5` checksum. This +cache is similar to Git's [objects database](https://git-scm.com/book/en/v2/Git-Internals-Git-Objects), but made specifically to handle large data files. diff --git a/public/static/docs/tutorials/versioning.md b/public/static/docs/tutorials/versioning.md index d626610506..6dd6e475a8 100644 --- a/public/static/docs/tutorials/versioning.md +++ b/public/static/docs/tutorials/versioning.md @@ -287,8 +287,9 @@ points to the `v1.0` of the dataset, while code and model files are from the As we have learned already, DVC keeps data files out of Git (by adjusting `.gitignore`) and puts them into the cache (usually it's a `.dvc/cache` directory inside the repository). Instead, DVC creates -[DVC-files](/doc/user-guide/dvc-file-format). These text files serve as pointers -(MD5 hash) to the cache and are version controlled by Git. +[DVC-files](/doc/user-guide/dvc-file-format). These text files serve as data +placeholders that point to the cached files, and they can be easily version +controlled with Git. When we run `git checkout` we restore pointers (DVC-files) first, then when we run `dvc checkout` we use these pointers to put the right data in the right diff --git a/public/static/docs/understanding-dvc/related-technologies.md b/public/static/docs/understanding-dvc/related-technologies.md index bf5dffa5bf..c194981fdb 100644 --- a/public/static/docs/understanding-dvc/related-technologies.md +++ b/public/static/docs/understanding-dvc/related-technologies.md @@ -77,8 +77,8 @@ http://studio.ml/ - File tracking: - - DVC tracks files based on checksum (MD5) instead of file timestamps. This - helps avoid running into heavy processes like model retraining when you + - DVC tracks files based on their checksum (MD5) instead of file timestamps. + This helps avoid running into heavy processes like model retraining when you checkout a previous, trained version of a model's code (Make would retrain the model). diff --git a/public/static/docs/understanding-dvc/what-is-dvc.md b/public/static/docs/understanding-dvc/what-is-dvc.md index d242777d44..94e5ebf030 100644 --- a/public/static/docs/understanding-dvc/what-is-dvc.md +++ b/public/static/docs/understanding-dvc/what-is-dvc.md @@ -27,7 +27,7 @@ DVC uses a few core concepts: results after a successful merge. - **Experiment state** or state: Equivalent to a Git snapshot (all committed - files). A Git commit hash, branch or tag name, etc. can be used as a + files). A Git commit SHA hash, branch or tag name, etc. can be used as a [reference](https://git-scm.com/book/en/v2/Git-Internals-Git-References) to an experiment state. diff --git a/public/static/docs/use-cases/versioning-data-and-model-files.md b/public/static/docs/use-cases/versioning-data-and-model-files.md index 16e9957564..5470a0b8c8 100644 --- a/public/static/docs/use-cases/versioning-data-and-model-files.md +++ b/public/static/docs/use-cases/versioning-data-and-model-files.md @@ -85,7 +85,7 @@ file. Let's consider the full checkout first. It's quite straightforward: > `v1.0` below is a Git tag that should be created in advance to identify the > dataset version you are interested in. Any revision (for example `HEAD^` or a -> commit hash) can be used instead. +> commit SHA hash) can be used instead. ```dvc $ git checkout v1.0 diff --git a/public/static/docs/user-guide/dvc-file-format.md b/public/static/docs/user-guide/dvc-file-format.md index 797a7cdd56..ea10adb3c4 100644 --- a/public/static/docs/user-guide/dvc-file-format.md +++ b/public/static/docs/user-guide/dvc-file-format.md @@ -69,8 +69,8 @@ A dependency entry consists of a pair of fields: - `rev`: Only present when the `--rev` option of `dvc import` is used. Specific [Git revision](https://git-scm.com/docs/revisions) (such as a branch name or a tag) used to import the dependency from. - - `rev_lock`: Git commit hash of the external DVC repository at - the time of importing or updating (with `dvc update`) the dependency. + - `rev_lock`: Git commit SHA hash of the external DVC repository + at the time of importing or updating (with `dvc update`) the dependency. > See the examples in > [External Dependencies](/doc/user-guide/external-dependencies) for more @@ -92,8 +92,8 @@ A metric entry consists of these fields: A `meta` entry consists of `key: value` pairs such as `name: John`. A meta entry can have any valid YAML structure containing any number of attributes. -`"meta: string"` is also possible, it doesn't need to contain a hash (a.k.a. -dictionary) structure always. +`"meta: string"` is also possible, it doesn't need to contain a _hash_ structure +(a.k.a. dictionary) always. Comments can be added to the DVC-file using `# comment` syntax. Comments and meta values are preserved between multiple executions of `dvc repro` and From 1d140862d4154301422b5ebcea461764b05bf7be Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 29 Jan 2020 12:54:47 -0600 Subject: [PATCH 09/27] term: rewrite definition of "workspace" per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-348857535 --- public/static/docs/glossary.js | 15 +++++---------- 1 file changed, 5 insertions(+), 10 deletions(-) diff --git a/public/static/docs/glossary.js b/public/static/docs/glossary.js index 38297685a2..48bc08b5ef 100644 --- a/public/static/docs/glossary.js +++ b/public/static/docs/glossary.js @@ -8,17 +8,12 @@ export default { name: 'Workspace', match: ['workspace'], desc: ` -Directory containing all your project files. For example raw datasets, source -code, ML models, etc. A workspace becomes a **DVC project** when -[\`dvc init\`](/doc/command-reference/init) is run, and -[DVC-files](/doc/user-guide/dvc-file-format) or stage files are created in it. - -Includes the +Collection of all your project files e.g. raw datasets, sourc code, ML models, +etc – typically in a single directory. +[external outputs](/doc/user-guide/managing-external-data) also form part of +your (expanded) workspace. This includes the [working tree](https://git-scm.com/docs/gitglossary#def_working_tree) (\`HEAD\` -plus local changes) for Git repositories. - -Note that [external outputs](/doc/user-guide/managing-external-data) also -form part of your expanded workspace, technically. ++ local changes) when using Git. ` }, { From e55c3626aefb643d930e971febc24f7ddff31f47 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 29 Jan 2020 18:50:12 -0600 Subject: [PATCH 10/27] cmd ref: change link from `metrics diff` options to `metrics show` per https://github.com/iterative/dvc.org/pull/933#issuecomment-580033273 --- public/static/docs/command-reference/metrics/diff.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/public/static/docs/command-reference/metrics/diff.md b/public/static/docs/command-reference/metrics/diff.md index a486891709..7165832d7e 100644 --- a/public/static/docs/command-reference/metrics/diff.md +++ b/public/static/docs/command-reference/metrics/diff.md @@ -48,7 +48,7 @@ shows changes for all metric files in it. - `-t`, `--type` - specify a type of the metric file. Accepted values are: `raw` (default), `json`, `tsv`, `htsv`, `csv`, `hcsv`. It will be used to determine - how to parse and format metics for display. See `dvc metrics modify` for more + how to parse and format metics for display. See `dvc metrics show` for more details. This option will override `type` and `xpath` defined in the corresponding @@ -59,7 +59,7 @@ shows changes for all metric files in it. specific metric value only. Should be used if the metric file contains multiple numbers and you want to use only one of them. Only a single path is allowed. It will override `xpath` defined in the corresponding DVC-file. See - `dvc metrics modify` for more details. + `dvc metrics show` for more details. - `--show-json` - prints the command's output in easily parsable JSON format, instead of a human-readable table. From f9895861df9d96e07e7b9f71d5e229af04de8ea1 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 29 Jan 2020 20:16:37 -0600 Subject: [PATCH 11/27] cmd ref: update example in `dvc metrics diff` and similar ones per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-348851587 --- .../static/docs/command-reference/checkout.md | 57 ++++++++----------- public/static/docs/command-reference/fetch.md | 6 +- public/static/docs/command-reference/get.md | 5 +- .../static/docs/command-reference/install.md | 4 +- .../docs/command-reference/metrics/diff.md | 52 +++++++++++++---- public/static/docs/tutorials/versioning.md | 2 +- 6 files changed, 71 insertions(+), 55 deletions(-) diff --git a/public/static/docs/command-reference/checkout.md b/public/static/docs/command-reference/checkout.md index c60d4b3b42..4853ad4e12 100644 --- a/public/static/docs/command-reference/checkout.md +++ b/public/static/docs/command-reference/checkout.md @@ -100,10 +100,9 @@ be pulled from remote storage using `dvc pull`. ## Examples Let's employ a simple workspace with some data, code, ML models, -pipeline stages, as well as a few Git tags, such as our -[get started example repo](https://github.com/iterative/example-get-started). -Then we can see what happens with `git checkout` and `dvc checkout` as we switch -from tag to tag. +pipeline stages, such as the DVC project created in our +[Get Started](/doc/get-started) section. Then we can see what happens with +`git checkout` and `dvc checkout` as we switch from tag to tag.
@@ -118,8 +117,7 @@ $ cd example-get-started
-The workspace looks almost like in this -[pipeline setup](/doc/tutorials/pipelines): +The workspace looks like this: ```dvc . @@ -133,15 +131,6 @@ The workspace looks almost like in this └── ``` -We have these tags in the repository that represent different iterations of -solving the problem: - -```dvc -$ git tag -baseline-experiment <- first simple version of the model -bigrams-experiment <- use bigrams to improve the model -``` - This project comes with a predefined HTTP [remote storage](/doc/command-reference/remote). We can now just run `dvc pull` that will fetch and checkout the most recent `model.pkl`, `data.xml`, and other @@ -152,56 +141,56 @@ files that are under DVC control. The model file checksum ```dvc $ dvc pull ... -Checking out model.pkl with cache '3863d0e317dee0a55c4e59d2ec0eef33' +Checking out model.pkl with cache '662eb7f64216d9c2c1088d0a5e2c6951' ... $ md5 model.pkl -MD5 (model.pkl) = 3863d0e317dee0a55c4e59d2ec0eef33 +MD5 (model.pkl) = 662eb7f64216d9c2c1088d0a5e2c6951 ``` -What if we want to rewind history, so to speak? The `git checkout` command lets -us checkout at any point in the commit history, or even checkout other tags. It +What if we want to "rewind history", so to speak? The `git checkout` command +lets us restore any point in the repository history, including any tags. It automatically adjusts the files, by replacing file content and adding or deleting files as necessary. ```dvc -$ git checkout baseline -Note: checking out 'baseline'. +$ git checkout 7-train # Tag to stage where model is created +Note: checking out '7-train'. ... -HEAD is now at 40cc182... +HEAD is now at 2df4172... ``` Let's check the `model.pkl` entry in `train.dvc` now: ```yaml outs: - md5: a66489653d1b6a8ba989799367b32c43 - path: model.pkl + - md5: 43630cce66a2432dcecddc9dd006d0a7 + path: model.pkl ``` But if you check `model.pkl`, the file checksum is still the same: ```dvc $ md5 model.pkl -MD5 (model.pkl) = 3863d0e317dee0a55c4e59d2ec0eef33 +MD5 (model.pkl) = 662eb7f64216d9c2c1088d0a5e2c6951 ``` This is because `git checkout` changed `featurize.dvc`, `train.dvc`, and other DVC-files. But it did nothing with the `model.pkl` and `matrix.pkl` files. Git -doesn't track those files, DVC does, so we must do this: +doesn't track those files; DVC does, so we must do this: ```dvc $ dvc fetch $ dvc checkout $ md5 model.pkl -MD5 (model.pkl) = a66489653d1b6a8ba989799367b32c43 +MD5 (model.pkl) = 43630cce66a2432dcecddc9dd006d0a7 ``` -What happened is that DVC went through the sole existing DVC-file and adjusted -the current set of files to match the `outs` of that stage. `dvc fetch` is run +What happened is that DVC went through the sole project DVC-files and adjusted +the current set of files to match the `outs` in them. `dvc fetch` is run this once to download missing data from the remote storage to the cache. -Alternatively, we could have just run `dvc pull` in this case to automatically -do `dvc fetch` + `dvc checkout`. +(Alternatively, we could have just run `dvc pull` to do `dvc fetch` + +`dvc checkout` in one step.) ## Automating `dvc checkout` @@ -222,9 +211,9 @@ running `dvc checkout` when needed. We can then checkout the master branch again: ```dvc -$ git checkout bigrams -Previous HEAD position was d171a12 add evaluation stage -HEAD is now at d092b42 try using bigrams +$ git checkout 9-bigrams-model # Bigrams version of the model +Previous HEAD position was dd2cc99 Create evaluation stage +HEAD is now at 72e0f12 try using 9-bigrams-model Checking out model.pkl with cache '3863d0e317dee0a55c4e59d2ec0eef33'. $ md5 model.pkl diff --git a/public/static/docs/command-reference/fetch.md b/public/static/docs/command-reference/fetch.md index 8e8aeca7c1..8c8e06ade3 100644 --- a/public/static/docs/command-reference/fetch.md +++ b/public/static/docs/command-reference/fetch.md @@ -115,9 +115,9 @@ specified in DVC-files currently in the project are considered by `dvc fetch` ## Examples Let's employ a simple workspace with some data, code, ML models, -pipeline stages, as well as a few Git tags, such as our -[get started example repo](https://github.com/iterative/example-get-started). -Then we can see what happens with `dvc fetch` as we switch from tag to tag. +pipeline stages, such as the DVC project created in our +[Get Started](/doc/get-started) section. Then we can see what happens with +`dvc fetch` as we switch from tag to tag.
diff --git a/public/static/docs/command-reference/get.md b/public/static/docs/command-reference/get.md index b98bb179c0..c9378a64a9 100644 --- a/public/static/docs/command-reference/get.md +++ b/public/static/docs/command-reference/get.md @@ -121,7 +121,7 @@ $ ls install.sh ``` -### Example: Getting the storage URL of a DVC-tracked file +## Example: Getting the storage URL of a DVC-tracked file We can use `dvc get --show-url` to get the actual location where the final model file from our @@ -129,7 +129,8 @@ file from our stored: ```dvc -$ dvc get https://github.com/iterative/example-get-started model.pkl --show-url +$ dvc get --show-url \ + https://github.com/iterative/example-get-started model.pkl https://remote.dvc.org/get-started/66/2eb7f64216d9c2c1088d0a5e2c6951 ``` diff --git a/public/static/docs/command-reference/install.md b/public/static/docs/command-reference/install.md index 0f2e702cea..d13e4f795b 100644 --- a/public/static/docs/command-reference/install.md +++ b/public/static/docs/command-reference/install.md @@ -131,7 +131,6 @@ Let's first list the available tags in the _Get Started_ project: ```dvc $ git tag - 0-empty 1-initialize 2-remote @@ -158,7 +157,6 @@ Note: checking out '6-featurization'. You are in 'detached HEAD' state... $ dvc status - featurize.dvc: changed outs: modified: data/features @@ -241,7 +239,7 @@ If we simply edit one of the code files: ```dvc $ vi src/featurization.py -$ git commit -a -m "modified featurization" +$ git commit -a -m "Modified featurization" featurize.dvc: changed deps: diff --git a/public/static/docs/command-reference/metrics/diff.md b/public/static/docs/command-reference/metrics/diff.md index 7165832d7e..6e41875528 100644 --- a/public/static/docs/command-reference/metrics/diff.md +++ b/public/static/docs/command-reference/metrics/diff.md @@ -25,8 +25,8 @@ positional arguments: ## Description The changes shown by this command includes the new value, and numeric difference -(delta) from the previous value of metrics. They're calculated between two -different +(delta) from the previous value of metrics (with 3-digit accuracy). They're +calculated between two different [Git references](https://git-scm.com/book/en/v2/Git-Internals-Git-References) (commit SHA hash, branch or tag name, etc.) for all metrics in the project, found by examining all of the @@ -73,25 +73,53 @@ shows changes for all metric files in it. ## Examples -Let's create a metrics file using a dummy command and commit it with Git: +Let's employ a simple workspace with some data, code, ML models, +pipeline stages, such as the DVC project created in our +[Get Started](/doc/get-started) section. Then we can see what happens with +`dvc install` in different situations. +
+ +### Click and expand to setup the project + +Start by cloning our example repo if you don't already have it: + +```dvc +$ git clone https://github.com/iterative/example-get-started +$ cd example-get-started ``` -$ dvc run -M metrics.json 'echo "{\"AUC\": 0.5}" > metrics.json' -$ git commit -a -m "add metrics" + +
+ +Notice that we have an `auc.metric` metric file: + +``` +$ cat auc.metric +0.602818 ``` -Now let's say we've adjusted our scripts and our AUC has changed: +Now let's mock a change in our AUC metric: ``` -$ dvc run -M metrics.json 'echo "{\"AUC\": 0.6}" > metrics.json' +$ echo '0.5' > auc.metric ``` -To see the change, let's run `dvc metrics diff` without arguments. This compares -our current workspace metrics to what we had in the previous -commit: +To see the change, let's run `dvc metrics diff`. This compares our current +workspace (including uncommitted local changes) metrics to what we +had in the previous commit: ``` +$ git diff +--- a/auc.metric ++++ b/auc.metric +@@ -1 +1 @@ +-0.602818 ++0.5 + $ dvc metrics diff - Path Metric Value Change -metrics.json AUC 0.600 0.100 + Path Metric Value Change +auc.metric 0.500 -0.103 ``` + +> Note that metric files are typically versioned with Git, so we can use both +> `git diff` and `dvc metrics diff` to understand their changes, as seen above. diff --git a/public/static/docs/tutorials/versioning.md b/public/static/docs/tutorials/versioning.md index 6dd6e475a8..a8f04750ba 100644 --- a/public/static/docs/tutorials/versioning.md +++ b/public/static/docs/tutorials/versioning.md @@ -374,7 +374,7 @@ hands-on experience with pipelines, and try to apply it here. Don't hesitate to join our [community](/chat) and ask any questions! Another detail we only brushed upon here is the way we captured the -`metrics.csv` metrics file with the `-M` option of `dvc run`. Marking this +`metrics.csv` metric file with the `-M` option of `dvc run`. Marking this output as a metric enables us to compare its values across Git tags or branches (for example, representing different experiments). See `dvc metrics` and [Compare Experiments](/doc/get-started/compare-experiments) to learn more From 734994a5cb9232cf7e955ec248fe6a1b30fbf276 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 29 Jan 2020 20:50:21 -0600 Subject: [PATCH 12/27] cmd ref: simplify dvc gc -a option per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-350540249 and https://github.com/iterative/dvc.org/pull/933#pullrequestreview-350540611 --- public/static/docs/command-reference/gc.md | 8 +++----- public/static/docs/command-reference/pull.md | 4 ++-- 2 files changed, 5 insertions(+), 7 deletions(-) diff --git a/public/static/docs/command-reference/gc.md b/public/static/docs/command-reference/gc.md index a8b8f1c981..c148a7d143 100644 --- a/public/static/docs/command-reference/gc.md +++ b/public/static/docs/command-reference/gc.md @@ -33,11 +33,9 @@ restored using `dvc fetch`, as long as they have previously been uploaded with ## Options -- `-a`, `--all-branches` - keep cached objects referenced from the latest - [revisions](https://git-scm.com/docs/revisions) across all Git branches (tip - commits). It should be used if you want to keep data for the latest experiment - revisions. Especially, if you intend to use `dvc gc -c` this option is much - safer. +- `-a`, `--all-branches` - keep cached objects referenced in all Git branches. + Useful for keeping data for all the latest experiment versions. This option is + a safer alternative to `dvc gc -c`. - `-T`, `--all-tags` - the same as `-a` above, but applies to Git tags. It's useful if tags are used to track "checkpoints" of an experiment or project. diff --git a/public/static/docs/command-reference/pull.md b/public/static/docs/command-reference/pull.md index fcc2bc3891..b6cbc2ed72 100644 --- a/public/static/docs/command-reference/pull.md +++ b/public/static/docs/command-reference/pull.md @@ -65,8 +65,8 @@ reflinks or hardlinks to put it in the workspace without copying. See (configured with the `core.config` config option) is used. - `-a`, `--all-branches` - determines the files to download by examining - DVC-files in all Git branches of the project repository (if using Git). It's - useful if branches are used to track experiments or project checkpoints. + DVC-files all Git branches of the repository. It's useful if branches are used + to track experiments or project checkpoints. - `-T`, `--all-tags` - the same as `-a`, `--all-branches` but Git tags are used to save different experiments or project checkpoints. Note that both options From 961b51357cae6bd7d2dd4a5ccbfdba6dd3ee9102 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 29 Jan 2020 23:27:42 -0600 Subject: [PATCH 13/27] cmd ref: use "reference" more than "revision" in diff per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-350539579 --- public/static/docs/command-reference/diff.md | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/public/static/docs/command-reference/diff.md b/public/static/docs/command-reference/diff.md index 7a14e8b189..c7280d85c8 100644 --- a/public/static/docs/command-reference/diff.md +++ b/public/static/docs/command-reference/diff.md @@ -26,7 +26,7 @@ a summary of basic statistics: how many files were deleted/changed, and the file size differences. `a_ref` is required, while `b_ref` defaults to `HEAD`. Note that `dvc diff` does not show the line-to-line comparison among the target -files in each revision, like `git diff` does. +files in each reference, like `git diff` does. > For an example on how to create line-to-line text file comparison, refer to > [issue #770](https://github.com/iterative/dvc/issues/770#issuecomment-512693256) @@ -80,12 +80,12 @@ Preparing to download data from 'https://remote.dvc.org/get-started' ## Example: Previous version of the same branch -The minimal `dvc diff` command only includes the "from" revision (`a_ref`) from -which to calculate the difference. The "until" revision (`b_ref`) defaults to -`HEAD` (currently checked out Git revision). +The minimal `dvc diff` command only includes the "from" reference (`a_ref`) from +which to calculate the difference. The "until" reference (`b_ref`) defaults to +`HEAD` (current Git revision). -To find the general differences with the very previous revision of the project, -we can use `HEAD^` as `a_ref`: +To see the difference with the very previous revision of the project, we can use +`HEAD^` as `a_ref`: ```dvc $ dvc diff HEAD^ @@ -127,7 +127,7 @@ example repo.
-To see the difference in `model.pkl` among these revisions, we can run the +To see the difference in `model.pkl` among these references, we can run the following command. ```dvc @@ -141,7 +141,7 @@ diff for 'model.pkl' ``` The output from this command confirms that there's a difference in the -`model.pkl` file between the 2 Git revisions (tags `baseline-experiment` and +`model.pkl` file between the 2 Git references (tags `baseline-experiment` and `bigrams-experiment`) we indicated. ### What about directories? @@ -191,5 +191,5 @@ diff for 'data/prepared' The command above checks whether there have been any changes to the `data/prepared` directory after the `5-preparation` revision (since the `b_ref` -is the current revision, `HEAD` by default). The output tells us that there have -been no changes to that directory (or to any other file). +is `HEAD` by default). The output tells us that there have been no changes to +that directory (or to any other file). From 6b259ba5b15d965b1da83655a284a3acad33c031 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 29 Jan 2020 23:31:47 -0600 Subject: [PATCH 14/27] cmd ref: link term "revision" in diff and `metrics diff` also per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-350539579 --- public/static/docs/command-reference/diff.md | 5 +++-- public/static/docs/command-reference/metrics/diff.md | 6 +++--- 2 files changed, 6 insertions(+), 5 deletions(-) diff --git a/public/static/docs/command-reference/diff.md b/public/static/docs/command-reference/diff.md index c7280d85c8..25ef8b8ad0 100644 --- a/public/static/docs/command-reference/diff.md +++ b/public/static/docs/command-reference/diff.md @@ -1,7 +1,8 @@ # diff -Show changes between revisions of the DVC repository. It can be -narrowed down to specific target files and directories under DVC control. +Show changes between DVC repository +[revisions](https://git-scm.com/docs/revisions). It can be narrowed down to +specific target files and directories under DVC control. > This command requires that the project is a > [Git](https://git-scm.com/) repository. diff --git a/public/static/docs/command-reference/metrics/diff.md b/public/static/docs/command-reference/metrics/diff.md index 6e41875528..d68bf8d33a 100644 --- a/public/static/docs/command-reference/metrics/diff.md +++ b/public/static/docs/command-reference/metrics/diff.md @@ -2,7 +2,7 @@ Show a table of changes between [metrics](/doc/command-reference/metrics#description) among DVC -repository revisions. +repository [revisions](https://git-scm.com/docs/revisions). > This command requires that the project is a > [Git](https://git-scm.com/) repository. @@ -30,7 +30,7 @@ calculated between two different [Git references](https://git-scm.com/book/en/v2/Git-Internals-Git-References) (commit SHA hash, branch or tag name, etc.) for all metrics in the project, found by examining all of the -[DVC-files](/doc/user-guide/dvc-file-format) in both revisions. +[DVC-files](/doc/user-guide/dvc-file-format) in both references. The metrics to use in this command can be limited with the `--targets` option. target can also be directories (with the `-R` option), so that DVC recursively @@ -40,7 +40,7 @@ shows changes for all metric files in it. - `--targets` - specific metric files or directories to calculate metrics differences for. If omitted (default), this command uses all metric files - found in both Git revisions. + found in both Git references. - `-R`, `--recursive` - determines the metric files to use by searching each target directory and its subdirectories for DVC-files to inspect. `targets` is From c006d18a5389f1a09c055a83339cc627b52aa34f Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 29 Jan 2020 23:49:12 -0600 Subject: [PATCH 15/27] term: put Git ref exapmles before term and link per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-350540611 --- public/static/docs/command-reference/diff.md | 10 +++++----- public/static/docs/command-reference/get.md | 8 ++++---- public/static/docs/command-reference/import.md | 8 ++++---- public/static/docs/command-reference/install.md | 4 ++-- public/static/docs/command-reference/metrics/diff.md | 7 +++---- public/static/docs/command-reference/update.md | 6 +++--- public/static/docs/use-cases/data-registries.md | 2 +- public/static/docs/user-guide/dvc-file-format.md | 5 +++-- 8 files changed, 25 insertions(+), 25 deletions(-) diff --git a/public/static/docs/command-reference/diff.md b/public/static/docs/command-reference/diff.md index 25ef8b8ad0..9ef171384c 100644 --- a/public/static/docs/command-reference/diff.md +++ b/public/static/docs/command-reference/diff.md @@ -20,11 +20,11 @@ positional arguments: ## Description -Given two -[Git references](https://git-scm.com/book/en/v2/Git-Internals-Git-References) -(commit hash, branch or tag name, etc.) `a_ref` and `b_ref`, this command shows -a summary of basic statistics: how many files were deleted/changed, and the file -size differences. `a_ref` is required, while `b_ref` defaults to `HEAD`. +Given two commit SHA hashes, branch or tag names, etc. +([Git references](https://git-scm.com/book/en/v2/Git-Internals-Git-References)) +`a_ref` and `b_ref`, this command shows a summary of basic statistics: how many +files were deleted/changed, and the file size differences. `a_ref` is required, +while `b_ref` defaults to `HEAD`. Note that `dvc diff` does not show the line-to-line comparison among the target files in each reference, like `git diff` does. diff --git a/public/static/docs/command-reference/get.md b/public/static/docs/command-reference/get.md index c9378a64a9..6e399efd90 100644 --- a/public/static/docs/command-reference/get.md +++ b/public/static/docs/command-reference/get.md @@ -58,10 +58,10 @@ name. an existing directory is specified, then the output will be placed inside of it. -- `--rev` - specific [Git revision](https://git-scm.com/docs/revisions) (such as - a commit SHA hash, or a branch or tag name) of the repository to download the - file or directory from. The tip of the default branch is used by default when - this option is not specified. +- `--rev` - specific commit SHA hash, branch or tag name, etc. (any + [Git revision](https://git-scm.com/docs/revisions)) of the repository to + download the file or directory from. The tip of the default branch is used by + default when this option is not specified. - `--show-url` - instead of downloading the file or directory, just print the storage location (URL) of the target data. `path` is expected to represent a diff --git a/public/static/docs/command-reference/import.md b/public/static/docs/command-reference/import.md index 2891815aef..ec59fbebe1 100644 --- a/public/static/docs/command-reference/import.md +++ b/public/static/docs/command-reference/import.md @@ -75,10 +75,10 @@ data artifact from the source project. an existing directory is specified, then the output will be placed inside of it. -- `--rev` - specific [Git revision](https://git-scm.com/docs/revisions) (such as - a commit SHA hash, or a branch or tag name) of the repository to download the - file or directory from. The tip of the default branch is used by default when - this option is not specified. +- `--rev` - specific commit SHA hash, branch or tag name, etc. (any + [Git revision](https://git-scm.com/docs/revisions)) of the repository to + download the file or directory from. The tip of the default branch is used by + default when this option is not specified. > Note that this adds a `rev` field in the import stage that fixes it to this > revision. This can impact the behavior of `dvc update`. (See diff --git a/public/static/docs/command-reference/install.md b/public/static/docs/command-reference/install.md index d13e4f795b..8e0311ead6 100644 --- a/public/static/docs/command-reference/install.md +++ b/public/static/docs/command-reference/install.md @@ -18,8 +18,8 @@ automatically. Namely: -**Checkout**: For any given branch or tag, `git checkout` retrieves the -[DVC-files](/doc/user-guide/dvc-file-format) corresponding to that +**Checkout**: For any commit SHA hash, branch or tag name, etc. `git checkout` +retrieves the [DVC-files](/doc/user-guide/dvc-file-format) corresponding to that [Git revision](https://git-scm.com/docs/revisions). The project's DVC-files in turn refer to data stored in cache, but not necessarily in the workspace. Normally, it would be necessary to diff --git a/public/static/docs/command-reference/metrics/diff.md b/public/static/docs/command-reference/metrics/diff.md index d68bf8d33a..06e27b41b9 100644 --- a/public/static/docs/command-reference/metrics/diff.md +++ b/public/static/docs/command-reference/metrics/diff.md @@ -26,10 +26,9 @@ positional arguments: The changes shown by this command includes the new value, and numeric difference (delta) from the previous value of metrics (with 3-digit accuracy). They're -calculated between two different -[Git references](https://git-scm.com/book/en/v2/Git-Internals-Git-References) -(commit SHA hash, branch or tag name, etc.) for all metrics in the -project, found by examining all of the +calculated between two commit SHA hashes, branch or tag names, etc. +([Git references](https://git-scm.com/book/en/v2/Git-Internals-Git-References)) +for all metrics in the project, found by examining all of the [DVC-files](/doc/user-guide/dvc-file-format) in both references. The metrics to use in this command can be limited with the `--targets` option. diff --git a/public/static/docs/command-reference/update.md b/public/static/docs/command-reference/update.md index 15c383251f..78be72ad65 100644 --- a/public/static/docs/command-reference/update.md +++ b/public/static/docs/command-reference/update.md @@ -27,9 +27,9 @@ Note that import stages are considered always locked, meaning that if you run `dvc repro`, they won't be updated. `dvc update` is the only command that can update them. -Another detail to note is that when the `--rev` (revision) option of -`dvc import` has been used to create an import stage, DVC is not aware of what -kind of +Another detail to note is that when the `--rev` +([revision](https://git-scm.com/docs/revisions)) option of `dvc import` has been +used to create an import stage, DVC is not aware of what kind of [Git reference](https://git-scm.com/book/en/v2/Git-Internals-Git-References) this is, for example a branch or a tag. For typically static references (e.g. tags), or for commit SHA hashes, `dvc update` will not have any effect on the diff --git a/public/static/docs/use-cases/data-registries.md b/public/static/docs/use-cases/data-registries.md index 5fd63e777b..76631578d6 100644 --- a/public/static/docs/use-cases/data-registries.md +++ b/public/static/docs/use-cases/data-registries.md @@ -109,7 +109,7 @@ This downloads `music/songs/` from the project's current working directory (anywhere in the file system with user write access). > Note that this command (as well as `dvc import`) has a `--rev` option to -> download specific revision of the data. +> download specific [revision](https://git-scm.com/docs/revisions) of the data. ### Import workflow diff --git a/public/static/docs/user-guide/dvc-file-format.md b/public/static/docs/user-guide/dvc-file-format.md index ea10adb3c4..3799677253 100644 --- a/public/static/docs/user-guide/dvc-file-format.md +++ b/public/static/docs/user-guide/dvc-file-format.md @@ -67,8 +67,9 @@ A dependency entry consists of a pair of fields: - `url`: URL of Git repository with source DVC project - `rev`: Only present when the `--rev` option of `dvc import` is used. - Specific [Git revision](https://git-scm.com/docs/revisions) (such as a - branch name or a tag) used to import the dependency from. + Specific commit SHA hash, branch or tag name, etc. (a + [Git revision](https://git-scm.com/docs/revisions)) used to import the + dependency from. - `rev_lock`: Git commit SHA hash of the external DVC repository at the time of importing or updating (with `dvc update`) the dependency. From e76329a75b3e33d3a41c3b45375b914924823f46 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 29 Jan 2020 23:51:43 -0600 Subject: [PATCH 16/27] cmd ref: friendlier explanation of "tip of default branch" per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-350540726 --- public/static/docs/command-reference/get.md | 4 ++-- public/static/docs/command-reference/import.md | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/public/static/docs/command-reference/get.md b/public/static/docs/command-reference/get.md index 6e399efd90..f042dbe40f 100644 --- a/public/static/docs/command-reference/get.md +++ b/public/static/docs/command-reference/get.md @@ -60,8 +60,8 @@ name. - `--rev` - specific commit SHA hash, branch or tag name, etc. (any [Git revision](https://git-scm.com/docs/revisions)) of the repository to - download the file or directory from. The tip of the default branch is used by - default when this option is not specified. + download the file or directory from. The latest commit in `master` (tip of the + default branch) is used by default when this option is not specified. - `--show-url` - instead of downloading the file or directory, just print the storage location (URL) of the target data. `path` is expected to represent a diff --git a/public/static/docs/command-reference/import.md b/public/static/docs/command-reference/import.md index ec59fbebe1..4a419ef74c 100644 --- a/public/static/docs/command-reference/import.md +++ b/public/static/docs/command-reference/import.md @@ -77,8 +77,8 @@ data artifact from the source project. - `--rev` - specific commit SHA hash, branch or tag name, etc. (any [Git revision](https://git-scm.com/docs/revisions)) of the repository to - download the file or directory from. The tip of the default branch is used by - default when this option is not specified. + download the file or directory from. The latest commit in `master` (tip of the + default branch) is used by default when this option is not specified. > Note that this adds a `rev` field in the import stage that fixes it to this > revision. This can impact the behavior of `dvc update`. (See From d02ccd2a752984f1dd53c52393c033e3d7bf4385 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 29 Jan 2020 23:56:58 -0600 Subject: [PATCH 17/27] cmd ref: use tag name instead of term "the revision" per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-350541159 --- public/static/docs/command-reference/diff.md | 7 ++++--- public/static/docs/command-reference/import-url.md | 12 +++++------- 2 files changed, 9 insertions(+), 10 deletions(-) diff --git a/public/static/docs/command-reference/diff.md b/public/static/docs/command-reference/diff.md index 9ef171384c..f83f7df8eb 100644 --- a/public/static/docs/command-reference/diff.md +++ b/public/static/docs/command-reference/diff.md @@ -59,9 +59,10 @@ For these examples we can use the chapters in our ### Click and expand to setup example Start by cloning our example repo if you don't already have it. Then move into -the repo and checkout -[the revision](https://github.com/iterative/example-get-started/releases/tag/3-add-file) -corresponding to the [Add Files](/doc/get-started/add-files) chapter: +the repo and checkout the +[3-add-file](https://github.com/iterative/example-get-started/releases/tag/3-add-file) +tag, corresponding to the [Add Files](/doc/get-started/add-files) _Get Started_ +chapter: ```dvc $ git clone https://github.com/iterative/example-get-started diff --git a/public/static/docs/command-reference/import-url.md b/public/static/docs/command-reference/import-url.md index 7f3a1d432b..68c9e02029 100644 --- a/public/static/docs/command-reference/import-url.md +++ b/public/static/docs/command-reference/import-url.md @@ -129,15 +129,13 @@ in the [Get Started](/doc/get-started) section.
-### Click and expand to setup the example project - -Follow these instructions before each example below if you actually want to try -them on your system. +### Click and expand to setup example Start by cloning our example repo if you don't already have it. Then move into -the repo and checkout -[the revision](https://github.com/iterative/example-get-started/releases/tag/2-remote) -corresponding to the [Configure](/doc/get-started/configure) chapter: +the repo and checkout the +[2-remote](https://github.com/iterative/example-get-started/releases/tag/2-remote) +tag, corresponding to the [Configure](/doc/get-started/configure) _Get Started_ +chapter: ```dvc $ git clone https://github.com/iterative/example-get-started From 14d4c235be4e75cf9af3e4bddd0e8a802365b483 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 30 Jan 2020 00:42:14 -0600 Subject: [PATCH 18/27] term: revert some "revision"->"reference" changes, and related simplifications per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-350541159 --- public/static/docs/command-reference/diff.md | 13 ++++--- public/static/docs/command-reference/get.md | 11 +++--- .../static/docs/command-reference/import.md | 2 +- .../static/docs/command-reference/install.md | 34 +++++++++---------- .../docs/command-reference/metrics/diff.md | 4 +-- .../docs/command-reference/metrics/index.md | 1 + .../docs/command-reference/metrics/show.md | 2 +- public/static/docs/command-reference/pull.md | 4 +-- public/static/docs/command-reference/push.md | 6 ++-- .../static/docs/command-reference/status.md | 6 ++-- public/static/docs/tutorials/versioning.md | 8 ++--- .../static/docs/use-cases/data-registries.md | 6 ++-- .../versioning-data-and-model-files.md | 9 ++--- 13 files changed, 53 insertions(+), 53 deletions(-) diff --git a/public/static/docs/command-reference/diff.md b/public/static/docs/command-reference/diff.md index f83f7df8eb..cb311165dc 100644 --- a/public/static/docs/command-reference/diff.md +++ b/public/static/docs/command-reference/diff.md @@ -1,8 +1,7 @@ # diff -Show changes between DVC repository -[revisions](https://git-scm.com/docs/revisions). It can be narrowed down to -specific target files and directories under DVC control. +Show changes between versions of the DVC repository. It can be +narrowed down to specific target files and directories under DVC control. > This command requires that the project is a > [Git](https://git-scm.com/) repository. @@ -84,7 +83,7 @@ Preparing to download data from 'https://remote.dvc.org/get-started' The minimal `dvc diff` command only includes the "from" reference (`a_ref`) from which to calculate the difference. The "until" reference (`b_ref`) defaults to -`HEAD` (current Git revision). +`HEAD` (current [Git revision](https://git-scm.com/docs/revisions)). To see the difference with the very previous revision of the project, we can use `HEAD^` as `a_ref`: @@ -192,6 +191,6 @@ diff for 'data/prepared' ``` The command above checks whether there have been any changes to the -`data/prepared` directory after the `5-preparation` revision (since the `b_ref` -is `HEAD` by default). The output tells us that there have been no changes to -that directory (or to any other file). +`data/prepared` directory after the `5-preparation` tag (since the `b_ref` is +`HEAD` by default). The output tells us that there have been no changes to that +directory (or to any other file). diff --git a/public/static/docs/command-reference/get.md b/public/static/docs/command-reference/get.md index f042dbe40f..16e3243086 100644 --- a/public/static/docs/command-reference/get.md +++ b/public/static/docs/command-reference/get.md @@ -177,12 +177,11 @@ $ dvc get . model.pkl --rev 7-train --out model.monograms.pkl The `model.monograms.pkl` file now contains the older version of the model. To get the most recent one, we use a similar command, but with - -`-o model.bigrams.pkl` and `--rev 9-bigrams-model` or even without `--rev` -(since it's the latest revision anyway). In fact, in this case using `dvc pull` -with the corresponding [DVC-files](/doc/user-guide/dvc-file-format) should -suffice, downloading the file as just `model.pkl`. We can then rename it to make -its variant explicit: +`-o model.bigrams.pkl` and `--rev 9-bigrams-model` (or even without `--rev` +since tag `9-bigrams-model` has the latest model version anyway). In fact, in +this case using `dvc pull` with the corresponding +[DVC-files](/doc/user-guide/dvc-file-format) should suffice, downloading the +file as just `model.pkl`. We can then rename it to make its variant explicit: ```dvc $ dvc pull train.dvc diff --git a/public/static/docs/command-reference/import.md b/public/static/docs/command-reference/import.md index 4a419ef74c..a66cc5a98d 100644 --- a/public/static/docs/command-reference/import.md +++ b/public/static/docs/command-reference/import.md @@ -159,7 +159,7 @@ deps: ``` If the -[Git-reference](https://git-scm.com/book/en/v2/Git-Internals-Git-References) +[Git reference](https://git-scm.com/book/en/v2/Git-Internals-Git-References) moves (e.g. a branch), you may use `dvc update` to bring the data up to date. However, for typically static references (e.g. tags), or for commit SHA hashes, in order to actually "update" an import, it's necessary to **re-import the diff --git a/public/static/docs/command-reference/install.md b/public/static/docs/command-reference/install.md index 8e0311ead6..90f0cac573 100644 --- a/public/static/docs/command-reference/install.md +++ b/public/static/docs/command-reference/install.md @@ -20,25 +20,24 @@ Namely: **Checkout**: For any commit SHA hash, branch or tag name, etc. `git checkout` retrieves the [DVC-files](/doc/user-guide/dvc-file-format) corresponding to that -[Git revision](https://git-scm.com/docs/revisions). The project's +[revision](https://git-scm.com/docs/revisions). The project's DVC-files in turn refer to data stored in cache, but not necessarily in the workspace. Normally, it would be necessary to run `dvc checkout` to synchronize workspace and DVC-files. The installed Git hook automates running `dvc checkout`. -**Commit**: When committing a change to the Git repository, that change possibly -requires reproducing the corresponding -[pipeline](/doc/command-reference/pipeline) (using `dvc repro`) to regenerate -the project results. Or there might be new data files not yet in cache, which -requires running `dvc commit` to store them. +**Commit**: When committing a change with Git, that change possibly requires +reproducing the corresponding [pipeline](/doc/command-reference/pipeline) (using +`dvc repro`) to regenerate the project results. Or there might be new data files +not yet in cache, which requires running `dvc commit` to store them. The installed Git hook automates reminding the user to run either `dvc repro` or `dvc commit`, as needed. -**Push**: While publishing changes to the Git remote repository with `git push`, -it easy to forget that the `dvc push` command is necessary to upload new or -updated data files and directories under DVC control to +**Push**: While publishing changes to the Git remote with `git push`, it easy to +forget that the `dvc push` command is necessary to upload new or updated data +files and directories under DVC control to [remote storage](/doc/command-reference/remote). The installed Git hook automates executing `dvc push`. @@ -121,11 +120,10 @@ $ dvc pull --all-branches --all-tags ## Example: Checkout both DVC and Git Let's start our exploration with the impact of `dvc install` on the -`dvc checkout` command. Remember that switching from one Git revision to another -(with `git checkout`) changes the set of -[DVC-files](/doc/user-guide/dvc-file-format) in the project. This changes the -set of data files that should be located in the workspace (which can be achieved -with `dvc checkout`). +`dvc checkout` command. Switching among repository versions (with +`git checkout`) changes the set of [DVC-files](/doc/user-guide/dvc-file-format) +in the project. This changes the set of data files that should be located in the +workspace (which can be achieved with `dvc checkout`). Let's first list the available tags in the _Get Started_ project: @@ -276,7 +274,7 @@ Data and pipelines are up to date. ``` After reproducing this pipeline up to the "evaluate" stage, the data files are -in sync with the code/config files, but we must now commit the changes to the -Git repository. Looking closely we see that `dvc status` is used again, -informing us that the data files are synchronized with the -`Data and pipelines are up to date.` message. +in sync with the code/config files, but we must now commit the changes with Git. +Looking closely we see that `dvc status` is used again, informing us that the +data files are synchronized with the `Data and pipelines are up to date.` +message. diff --git a/public/static/docs/command-reference/metrics/diff.md b/public/static/docs/command-reference/metrics/diff.md index 06e27b41b9..e1f8a4021b 100644 --- a/public/static/docs/command-reference/metrics/diff.md +++ b/public/static/docs/command-reference/metrics/diff.md @@ -1,8 +1,8 @@ # metrics diff Show a table of changes between -[metrics](/doc/command-reference/metrics#description) among DVC -repository [revisions](https://git-scm.com/docs/revisions). +[metrics](/doc/command-reference/metrics#description) among versions of the +DVC repository. > This command requires that the project is a > [Git](https://git-scm.com/) repository. diff --git a/public/static/docs/command-reference/metrics/index.md b/public/static/docs/command-reference/metrics/index.md index ebdd850361..e6b16f99b3 100644 --- a/public/static/docs/command-reference/metrics/index.md +++ b/public/static/docs/command-reference/metrics/index.md @@ -31,6 +31,7 @@ the best performing experiment. [Add](/doc/command-reference/metrics/add), [show](/doc/command-reference/metrics/show), +[diff](/doc/command-reference/metrics/show), [modify](/doc/command-reference/metrics/modify), and [remove](/doc/command-reference/metrics/remove) commands are available to set up and manage DVC project metrics. diff --git a/public/static/docs/command-reference/metrics/show.md b/public/static/docs/command-reference/metrics/show.md index e29d6c0e22..d69ba7fb82 100644 --- a/public/static/docs/command-reference/metrics/show.md +++ b/public/static/docs/command-reference/metrics/show.md @@ -39,7 +39,7 @@ extension.) > See `dvc metrics modify` to learn how to apply `-t` and `-x` permanently. See also `dvc metrics diff` to show changes in metrics between different -repository [revisions](https://git-scm.com/docs/revisions). +versions of the repository. ## Options diff --git a/public/static/docs/command-reference/pull.md b/public/static/docs/command-reference/pull.md index b6cbc2ed72..29f0ca1290 100644 --- a/public/static/docs/command-reference/pull.md +++ b/public/static/docs/command-reference/pull.md @@ -40,8 +40,8 @@ With no arguments, just `dvc pull` or `dvc pull --remote REMOTE`, it downloads only the files (or directories) missing from the workspace by searching all [DVC-files](/doc/user-guide/dvc-file-format) currently in the project. It will not download files associated with earlier -[revisions](https://git-scm.com/docs/revisions) of the repository -(if using Git), nor will it download files that have not changed. +versions of the repository (if using Git), nor will it download +files that have not changed. The command `dvc status -c` can list files referenced in current DVC-files, but missing in the cache. It can be used to see what files `dvc pull` diff --git a/public/static/docs/command-reference/push.md b/public/static/docs/command-reference/push.md index 9dc17c7b30..bd2842d7b1 100644 --- a/public/static/docs/command-reference/push.md +++ b/public/static/docs/command-reference/push.md @@ -52,9 +52,9 @@ configure a remote. With no arguments, just `dvc push` or `dvc push --remote REMOTE`, it uploads only the files (or directories) that are new in the local repository to remote -storage. It will not upload files associated with earlier -[revisions](https://git-scm.com/docs/revisions) of the repository -(if using Git), nor will it upload files that have not changed. +storage. It will not upload files associated with earlier versions of the +repository (if using Git), nor will it upload files that have not +changed. The `dvc status -c` command can list files tracked by DVC that are new in the cache (compared to the default remote.) It can be used to see what files diff --git a/public/static/docs/command-reference/status.md b/public/static/docs/command-reference/status.md index 922d017fe0..8094cebbe4 100644 --- a/public/static/docs/command-reference/status.md +++ b/public/static/docs/command-reference/status.md @@ -33,7 +33,7 @@ _cloud_ are triggered by using the `--cloud` or `--remote` options: DVC determines data and code files to compare by analyzing all [DVC-files](/doc/user-guide/dvc-file-format) in the repository -(`--all-branches` and `--all-tags` in the `cloud` mode compare multiple +(`--all-branches` and `--all-tags` in `--cloud` mode compare to multiple [Git revisions](https://git-scm.com/docs/revisions)). The comparison can be limited to specific DVC-files by listing them as `targets`. Changes are reported only against the given `targets`. When combined with the `--with-deps` option, a @@ -113,13 +113,13 @@ workspace) is different from remote storage. Bringing the two into sync requires name defined using the `dvc remote` command. Implies `--cloud`. - `-a`, `--all-branches` - compares cache content against all Git branches - instead of checking just the current revision. This basically runs the same + instead of checking just the current workspace. This basically runs the same status command in all the branches of this repo. The corresponding branches are shown in the status output. Applies only if `--cloud` or a `-r` remote is specified. - `-T`, `--all-tags` - compares cache content against all Git tags instead of - checking just the current revision. Similar to `-a` above. Note that both + checking just the current workspace. Similar to `-a` above. Note that both options can be combined, for example using the `-aT` flag. - `-j JOBS`, `--jobs JOBS` - specifies the number of jobs DVC can use to diff --git a/public/static/docs/tutorials/versioning.md b/public/static/docs/tutorials/versioning.md index a8f04750ba..7c69f223e4 100644 --- a/public/static/docs/tutorials/versioning.md +++ b/public/static/docs/tutorials/versioning.md @@ -268,8 +268,8 @@ data files, model, all of it. DVC optimizes this operation to avoid copying data or model files each time. So `dvc checkout` is quick even if you have large datasets, data files, or models. -On the other hand, if we want to keep the current revision of the code, but go -back to the previous dataset version, we can do something like this: +On the other hand, if we want to keep the current code, but go back to the +previous dataset version, we can do something like this: ```dvc $ git checkout v1.0 data.dvc @@ -277,8 +277,8 @@ $ dvc checkout data.dvc ``` If you run `git status` you'll see that `data.dvc` is modified and currently -points to the `v1.0` of the dataset, while code and model files are from the -`v2.0` [revision](https://git-scm.com/docs/revisions). +points to the `v1.0` version of the dataset, while code and model files are from +the `v2.0` tag.
diff --git a/public/static/docs/use-cases/data-registries.md b/public/static/docs/use-cases/data-registries.md index 76631578d6..c8bbc5071f 100644 --- a/public/static/docs/use-cases/data-registries.md +++ b/public/static/docs/use-cases/data-registries.md @@ -109,7 +109,8 @@ This downloads `music/songs/` from the project's current working directory (anywhere in the file system with user write access). > Note that this command (as well as `dvc import`) has a `--rev` option to -> download specific [revision](https://git-scm.com/docs/revisions) of the data. +> download the data from a specific +> [revision](https://git-scm.com/docs/revisions) of the source project. ### Import workflow @@ -144,7 +145,8 @@ updates the project dependency metadata in the import stage (DVC-file). ### Programatic reusability of DVC data Our Python API, included with the `dvc` package installed with DVC, includes the -`open` function to load/stream data directly from external DVC projects: +`open` function to load/stream data directly from external DVC +projects: ```python import dvc.api.open diff --git a/public/static/docs/use-cases/versioning-data-and-model-files.md b/public/static/docs/use-cases/versioning-data-and-model-files.md index 5470a0b8c8..5b94f215ee 100644 --- a/public/static/docs/use-cases/versioning-data-and-model-files.md +++ b/public/static/docs/use-cases/versioning-data-and-model-files.md @@ -84,8 +84,9 @@ full workspace checkout, or checkout of a specific data or model file. Let's consider the full checkout first. It's quite straightforward: > `v1.0` below is a Git tag that should be created in advance to identify the -> dataset version you are interested in. Any revision (for example `HEAD^` or a -> commit SHA hash) can be used instead. +> dataset version you are interested in. Any +> [Git reference](https://git-scm.com/book/en/v2/Git-Internals-Git-References) +> (for example `HEAD^` or a commit SHA hash) can be used instead. ```dvc $ git checkout v1.0 @@ -108,8 +109,8 @@ $ dvc checkout data.dvc ``` If you run `git status` you will see that `data.dvc` is modified and currently -points to the `v1.0` [revision](https://git-scm.com/docs/revisions) of the -repository. Meanwhile, code and model files are their latest versions. +points to the `v1.0` version of the cached data. Meanwhile, code +and model files are their latest versions. ![](/static/img/versioning.png) From e7e0b9750e5f817310680b671fdbfe939d0c8293 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 30 Jan 2020 02:19:51 -0600 Subject: [PATCH 19/27] cmd ref: review desc. of `-a` options throughout refs --- public/static/docs/command-reference/fetch.md | 9 +++++---- public/static/docs/command-reference/metrics/show.md | 11 ++++++----- public/static/docs/command-reference/pull.md | 5 +++-- public/static/docs/command-reference/push.md | 4 ++-- public/static/docs/command-reference/status.md | 7 +++---- 5 files changed, 19 insertions(+), 17 deletions(-) diff --git a/public/static/docs/command-reference/fetch.md b/public/static/docs/command-reference/fetch.md index 8c8e06ade3..8f9f769e45 100644 --- a/public/static/docs/command-reference/fetch.md +++ b/public/static/docs/command-reference/fetch.md @@ -94,10 +94,11 @@ specified in DVC-files currently in the project are considered by `dvc fetch` fetched. The default value is `4 * cpu_count()`. For SSH remotes default is just 4. -- `-a`, `--all-branches` - fetch cache for all Git branches, not just the active - one. This means DVC may download files needed to reproduce different versions - of a DVC-file ([experiments](/doc/get-started/experiments)), not just the - current one. +- `-a`, `--all-branches` - fetch cache for all Git branches instead of just the + current workspace. This means DVC may download files needed to reproduce + different versions of a DVC-file + ([experiments](/doc/get-started/experiments)), not just the ones currently in + the workspace. - `-T`, `--all-tags` - fetch cache for all Git tags. Similar to `-a` above. Note that both options can be combined, for example using the `-aT` flag. diff --git a/public/static/docs/command-reference/metrics/show.md b/public/static/docs/command-reference/metrics/show.md index d69ba7fb82..ececf1b2e9 100644 --- a/public/static/docs/command-reference/metrics/show.md +++ b/public/static/docs/command-reference/metrics/show.md @@ -80,12 +80,13 @@ versions of the repository. overwrite it for the current command run only – It may fail to produce any results or parse files that are not in a corresponding format in this case. -- `-a`, `--all-branches` - get and print metric file contents across all Git - branches. It can be used to compare different experiments. +- `-a`, `--all-branches` - print metric file contents in all Git branches + instead of just those present in the current workspace. It can be used to + compare different experiments. -- `-T`, `--all-tags` - get and print metric file contents across all Git tags. - Similar to `-a` above. Note that both options can be combined, for example - using the `-aT` flag. +- `-T`, `--all-tags` - print metric file contents in all Git tags. Similar to + `-a` above. Note that both options can be combined, for example using the + `-aT` flag. - `-R`, `--recursive` - determines the metric files to show by searching each target directory and its subdirectories for DVC-files to inspect. `targets` is diff --git a/public/static/docs/command-reference/pull.md b/public/static/docs/command-reference/pull.md index 29f0ca1290..f923decad0 100644 --- a/public/static/docs/command-reference/pull.md +++ b/public/static/docs/command-reference/pull.md @@ -65,8 +65,9 @@ reflinks or hardlinks to put it in the workspace without copying. See (configured with the `core.config` config option) is used. - `-a`, `--all-branches` - determines the files to download by examining - DVC-files all Git branches of the repository. It's useful if branches are used - to track experiments or project checkpoints. + DVC-files in all Git branches instead of just those present in the current + workspace. It's useful if branches are used to track experiments or project + checkpoints. - `-T`, `--all-tags` - the same as `-a`, `--all-branches` but Git tags are used to save different experiments or project checkpoints. Note that both options diff --git a/public/static/docs/command-reference/push.md b/public/static/docs/command-reference/push.md index bd2842d7b1..4f33b5e99d 100644 --- a/public/static/docs/command-reference/push.md +++ b/public/static/docs/command-reference/push.md @@ -74,8 +74,8 @@ to push. (configured with the `core.config` config option) is used. - `-a`, `--all-branches` - determines the files to upload by examining DVC-files - in all Git branches of the project repository (if using Git). It's useful if - branches are used to track experiments or project checkpoints. + in all Git branches instead of just those present in the current workspace. + It's useful if branches are used to track experiments or project checkpoints. - `-T`, `--all-tags` - the same as `-a`, `--all-branches`, but Git tags are used to save different experiments or project checkpoints. Note that both options diff --git a/public/static/docs/command-reference/status.md b/public/static/docs/command-reference/status.md index 8094cebbe4..2e332b20ae 100644 --- a/public/static/docs/command-reference/status.md +++ b/public/static/docs/command-reference/status.md @@ -113,10 +113,9 @@ workspace) is different from remote storage. Bringing the two into sync requires name defined using the `dvc remote` command. Implies `--cloud`. - `-a`, `--all-branches` - compares cache content against all Git branches - instead of checking just the current workspace. This basically runs the same - status command in all the branches of this repo. The corresponding branches - are shown in the status output. Applies only if `--cloud` or a `-r` remote is - specified. + instead of just the current workspace. This basically runs the same status + command in every branch of this repo. The corresponding branches are shown in + the status output. Applies only if `--cloud` or a `-r` remote is specified. - `-T`, `--all-tags` - compares cache content against all Git tags instead of checking just the current workspace. Similar to `-a` above. Note that both From c5dbb96d0859122d69f71a639f4427e6fd81f48a Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 30 Jan 2020 12:28:53 -0600 Subject: [PATCH 20/27] cmd ref: update diff params per iterative/dvc/pull/3244 --- public/static/docs/command-reference/diff.md | 7 ++++--- public/static/docs/command-reference/metrics/diff.md | 8 ++++---- 2 files changed, 8 insertions(+), 7 deletions(-) diff --git a/public/static/docs/command-reference/diff.md b/public/static/docs/command-reference/diff.md index cb311165dc..77b11e3047 100644 --- a/public/static/docs/command-reference/diff.md +++ b/public/static/docs/command-reference/diff.md @@ -12,9 +12,10 @@ narrowed down to specific target files and directories under DVC control. usage: dvc diff [-h] [-q | -v] [-t TARGET] a_ref [b_ref] positional arguments: - a_ref Git reference from which the diff begins - b_ref Git reference until which the diff ends. If omitted, - `HEAD` (latest commit) is used. + a_ref Git reference to the older version to compare + (defaults to `HEAD`) + b_ref Git reference to the newer version to compare + (defaults to the current workspace including changes) ``` ## Description diff --git a/public/static/docs/command-reference/metrics/diff.md b/public/static/docs/command-reference/metrics/diff.md index e1f8a4021b..64ec6302c9 100644 --- a/public/static/docs/command-reference/metrics/diff.md +++ b/public/static/docs/command-reference/metrics/diff.md @@ -16,10 +16,10 @@ usage: dvc metrics diff [-h] [-q | -v] [a_ref] [b_ref] positional arguments: - a_ref Git reference from which the diff begins. If omitted, - `HEAD` (latest commit) is used. - b_ref Git reference until which the diff ends. If omitted, - the current workspace is used instead. + a_ref Git reference to the older version to compare + (defaults to `HEAD`) + b_ref Git reference to the newer version to compare + (defaults to the current workspace including changes) ``` ## Description From b30df290325047a84e51c03ffb1c951ba0adb99a Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 30 Jan 2020 12:48:55 -0600 Subject: [PATCH 21/27] cmd ref: update notes around moving/static Git refs in import and update per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-350601759 --- public/static/docs/command-reference/import.md | 18 +++++++++--------- public/static/docs/command-reference/update.md | 6 +++--- 2 files changed, 12 insertions(+), 12 deletions(-) diff --git a/public/static/docs/command-reference/import.md b/public/static/docs/command-reference/import.md index a66cc5a98d..df17f64a77 100644 --- a/public/static/docs/command-reference/import.md +++ b/public/static/docs/command-reference/import.md @@ -158,15 +158,15 @@ deps: rev_lock: 0547f5883fb18e523e35578e2f0d19648c8f2d5c ``` -If the -[Git reference](https://git-scm.com/book/en/v2/Git-Internals-Git-References) -moves (e.g. a branch), you may use `dvc update` to bring the data up to date. -However, for typically static references (e.g. tags), or for commit SHA hashes, -in order to actually "update" an import, it's necessary to **re-import the -data** instead, by using `dvc import` again without or with a different `--rev`. -This will overwrite the import stage (DVC-file), either removing or replacing -the `rev` field, respectively. This can produce an import stage that is able to -be updated normally with `dvc update` going forward. For example: +If the `rev` +[reference](https://git-scm.com/book/en/v2/Git-Internals-Git-References) moves +(e.g. a branch), you may use `dvc update` to bring the data up to date. However, +for typically static references (e.g. tags), or for commit hashes, in order to +actually "update" an import, it's necessary to **re-import the data** instead, +by using `dvc import` again without or with a different `--rev`. This will +overwrite the import stage (DVC-file), either removing or replacing the `rev` +field, respectively. This can produce an import stage that is able to be updated +normally with `dvc update` going forward. For example: ```dvc $ dvc import --rev master \ diff --git a/public/static/docs/command-reference/update.md b/public/static/docs/command-reference/update.md index 78be72ad65..b3652e0dd2 100644 --- a/public/static/docs/command-reference/update.md +++ b/public/static/docs/command-reference/update.md @@ -30,9 +30,9 @@ update them. Another detail to note is that when the `--rev` ([revision](https://git-scm.com/docs/revisions)) option of `dvc import` has been used to create an import stage, DVC is not aware of what kind of -[Git reference](https://git-scm.com/book/en/v2/Git-Internals-Git-References) -this is, for example a branch or a tag. For typically static references (e.g. -tags), or for commit SHA hashes, `dvc update` will not have any effect on the +[Git reference](https://git-scm.com/book/en/v2/Git-Internals-Git-References) has +been provided, for example a branch or a tag. For typically static references +(e.g. tags), or for commit hashes, `dvc update` will not have any effect on the import. Refer to the [re-importing example](/doc/command-reference/import#example-fixed-revisions-re-importing) to learn how to "update" fixed-revision imports. From 1e9f3aee63140343c31e59002d03fe9535029cb6 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 30 Jan 2020 13:08:26 -0600 Subject: [PATCH 22/27] revert workspace glossary entry per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-350604641 --- public/static/docs/glossary.js | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/public/static/docs/glossary.js b/public/static/docs/glossary.js index 48bc08b5ef..40d252c180 100644 --- a/public/static/docs/glossary.js +++ b/public/static/docs/glossary.js @@ -8,12 +8,13 @@ export default { name: 'Workspace', match: ['workspace'], desc: ` -Collection of all your project files e.g. raw datasets, sourc code, ML models, -etc – typically in a single directory. -[external outputs](/doc/user-guide/managing-external-data) also form part of -your (expanded) workspace. This includes the -[working tree](https://git-scm.com/docs/gitglossary#def_working_tree) (\`HEAD\` -+ local changes) when using Git. +Directory containing all your project files. For example raw datasets, source +code, ML models, etc. A workspace becomes a **DVC project** when +[\`dvc init\`](/doc/command-reference/init) is run, and +[DVC-files](/doc/user-guide/dvc-file-format) or stage files are created in it. + +Note that [external outputs](/doc/user-guide/managing-external-data) also +form part of your expanded workspace, technically. ` }, { From af6fc6391782dfa046f3222bedc8cdf98aee51ff Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 30 Jan 2020 13:16:32 -0600 Subject: [PATCH 23/27] tutorial: use full name of Deep Dive Tutorial in title and links per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-350605629 --- public/static/docs/get-started/pipeline.md | 4 ++-- public/static/docs/tutorials/deep/index.md | 2 +- public/static/docs/tutorials/pipelines.md | 3 ++- 3 files changed, 5 insertions(+), 4 deletions(-) diff --git a/public/static/docs/get-started/pipeline.md b/public/static/docs/get-started/pipeline.md index b8d3da245c..309f7e46b7 100644 --- a/public/static/docs/get-started/pipeline.md +++ b/public/static/docs/get-started/pipeline.md @@ -36,8 +36,8 @@ $ dvc push ``` This example is simplified just to show you a basic pipeline, see a more -advanced [example](/doc/tutorials/pipelines) or complete -[tutorial](/doc/tutorials/deep) to create a +advanced [example](/doc/tutorials/pipelines) or +[complete tutorial](/doc/tutorials/pipelines) to create a [NLP](https://en.wikipedia.org/wiki/Natural_language_processing) pipeline end-to-end. diff --git a/public/static/docs/tutorials/deep/index.md b/public/static/docs/tutorials/deep/index.md index eb80b091f6..2f2e22d222 100644 --- a/public/static/docs/tutorials/deep/index.md +++ b/public/static/docs/tutorials/deep/index.md @@ -1,4 +1,4 @@ -# Tutorial +# Deep Dive Tutorial This tutorial shows you how to solve a text classification problem using the DVC tool. diff --git a/public/static/docs/tutorials/pipelines.md b/public/static/docs/tutorials/pipelines.md index 1038629952..89957e791b 100644 --- a/public/static/docs/tutorials/pipelines.md +++ b/public/static/docs/tutorials/pipelines.md @@ -5,7 +5,8 @@ Let's explore the natural language processing ([NLP](https://en.wikipedia.org/wiki/Natural_language_processing)) problem of predicting tags for a given StackOverflow question. For example, we want a classifier that can predict posts about the Python language by tagging them -`python`. (This is a short version of the [Deep Tutorial](/doc/tutorials/deep).) +`python`. (This is a short version of the +[Deep Dive Tutorial](/doc/tutorials/deep).) In this example, we will focus on building a simple ML [pipeline](/doc/command-reference/pipeline) that takes an archive with From bd0c9bd141765a9fc36463ddb6383c4e189d4e7b Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 30 Jan 2020 13:18:29 -0600 Subject: [PATCH 24/27] user-guide: undo change on "binary" literal for analytics example per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-350607861 --- public/static/docs/user-guide/analytics.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/public/static/docs/user-guide/analytics.md b/public/static/docs/user-guide/analytics.md index 2519f19f11..2fda16f988 100644 --- a/public/static/docs/user-guide/analytics.md +++ b/public/static/docs/user-guide/analytics.md @@ -29,7 +29,7 @@ DVC's analytics record the following information per event: - The operating system information, e.g. `linux`, `ubuntu`, `14.04`, etc. - Command type, e.g. `CmdDataPull` - Command return code, e.g. `1` -- Way the DVC was installed, e.g. `Binary` +- Way the DVC was installed, e.g. `binary` - A DVC analytics user ID (e.g. `8ca59a29-ddd9-4247-992a-9b4775732aad`), generated by [`uuid`](https://docs.python.org/3/library/uuid.html) From a0c51ff124b490804c19a1b8f1f82a63b57a7513 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 30 Jan 2020 13:24:49 -0600 Subject: [PATCH 25/27] use-cases: avoid term "revision" in data-registries per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-350608591 --- public/static/docs/use-cases/data-registries.md | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/public/static/docs/use-cases/data-registries.md b/public/static/docs/use-cases/data-registries.md index c8bbc5071f..010533d1b7 100644 --- a/public/static/docs/use-cases/data-registries.md +++ b/public/static/docs/use-cases/data-registries.md @@ -109,8 +109,8 @@ This downloads `music/songs/` from the project's current working directory (anywhere in the file system with user write access). > Note that this command (as well as `dvc import`) has a `--rev` option to -> download the data from a specific -> [revision](https://git-scm.com/docs/revisions) of the source project. +> download the data from a specific [commit](https://git-scm.com/docs/revisions) +> of the source repository. ### Import workflow @@ -138,8 +138,7 @@ $ dvc update dataset.dvc ``` `dvc update` downloads new and changed files, or removes deleted ones, from -`images/faces/`, based on the latest -[revision](https://git-scm.com/docs/revisions) of the source project. It also +`images/faces/`, based on the latest version of the source project. It also updates the project dependency metadata in the import stage (DVC-file). ### Programatic reusability of DVC data From 67e025ef8ba8cbb74ed136e316bed32eb95455e1 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 30 Jan 2020 16:06:09 -0600 Subject: [PATCH 26/27] term: avoid "checksum" in favor of file "hash" value per https://github.com/iterative/dvc.org/pull/933#pullrequestreview-350538870 --- public/static/docs/changelog/0.18.md | 2 +- public/static/docs/changelog/0.35.md | 6 ++-- public/static/docs/command-reference/add.md | 16 ++++----- .../static/docs/command-reference/checkout.md | 6 ++-- .../static/docs/command-reference/commit.md | 16 ++++----- .../static/docs/command-reference/config.md | 14 ++++---- public/static/docs/command-reference/fetch.md | 19 +++++------ public/static/docs/command-reference/gc.md | 2 +- public/static/docs/command-reference/get.md | 2 +- .../docs/command-reference/import-url.md | 8 ++--- .../static/docs/command-reference/install.md | 10 +++--- .../docs/command-reference/remote/add.md | 5 ++- .../docs/command-reference/remote/modify.md | 9 +++-- public/static/docs/command-reference/repro.md | 4 +-- public/static/docs/command-reference/run.md | 6 ++-- .../static/docs/command-reference/status.md | 33 +++++++++---------- public/static/docs/get-started/add-files.md | 6 ++-- public/static/docs/get-started/store-data.md | 2 +- .../docs/tutorials/deep/define-ml-pipeline.md | 8 ++--- .../docs/tutorials/deep/reproducibility.md | 14 ++++---- public/static/docs/tutorials/pipelines.md | 6 ++-- .../understanding-dvc/related-technologies.md | 6 ++-- .../static/docs/user-guide/dvc-file-format.md | 7 ++-- .../user-guide/dvc-files-and-directories.md | 30 ++++++++--------- .../docs/user-guide/managing-external-data.md | 5 ++- 25 files changed, 118 insertions(+), 124 deletions(-) diff --git a/public/static/docs/changelog/0.18.md b/public/static/docs/changelog/0.18.md index 945f644a56..92021df8a4 100644 --- a/public/static/docs/changelog/0.18.md +++ b/public/static/docs/changelog/0.18.md @@ -28,7 +28,7 @@ really excited to share the progress with you: - 🙂 **Usability improvements** - DVC interface got more informative and easier to use: - - More heavy operations render dynamic progress bar (e.g. checksum + - More heavy operations render dynamic progress bar (e.g. file hash computation): ![](/static/img/0.18-progress.gif) - Pipeline visualization via command line. Just run `dvc pipeline show` with diff --git a/public/static/docs/changelog/0.35.md b/public/static/docs/changelog/0.35.md index 20d7b69e80..f229851be2 100644 --- a/public/static/docs/changelog/0.35.md +++ b/public/static/docs/changelog/0.35.md @@ -59,9 +59,9 @@ improvements) we have done in the last few months: - ⚡️ **Performance optimizations.** The most notable one is the migration from using a plain JSON file to an (embedded) SQLLite instance, to cache file and - directory checksums. Another one is improved performance, stability and - general user experience for the commands that navigate tags or branches (all - the commands that include `--all-bracnhes`, `-a` or `--all-tags`, `-T`). + directory hashes. Another one is improved performance, stability and general + user experience for the commands that navigate tags or branches (all the + commands that include `--all-bracnhes`, `-a` or `--all-tags`, `-T`). There are new [integrations and plugins](/doc/install/plugins) available: diff --git a/public/static/docs/command-reference/add.md b/public/static/docs/command-reference/add.md index c74b6f421e..73a6d55c61 100644 --- a/public/static/docs/command-reference/add.md +++ b/public/static/docs/command-reference/add.md @@ -29,16 +29,16 @@ that becomes [external outputs](/doc/user-guide/managing-external-data). Under the hood, a few actions are taken for each file (or directory) in `targets`: -1. Calculate the file checksum. +1. Calculate the file hashes. 2. Move the file contents to the cache directory (by default in `.dvc/cache`), - using the checksum to form the cached file names. (See + using the file hash to form the cached file names. (See [Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory) for more details.) 3. Attempt to replace the file by a link to the file in cache (more details below). -4. Create a corresponding DVC-file and store the checksum to identify the cached - file. Unless the `-f` option is used, the DVC-file name generated by default - is `.dvc`, where `` is the file name of the first target. +4. Create a corresponding DVC-file and store the file hash to identify the + cached file. Unless the `-f` option is used, the DVC-file name generated by + default is `.dvc`, where `` is the file name of the first target. 5. Unless `dvc init --no-scm` was used when initializing the project, add the `targets` to `.gitignore` in order to prevent them from being committed to the Git repository. @@ -48,7 +48,7 @@ Under the hood, a few actions are taken for each file (or directory) in The result is that the target data gets cached by DVC, and instead small DVC-files can be tracked with Git. The DVC-file lists the added file as an -output (`outs` field), and references the cached file using the checksum. See +output (`outs` field), and references the cached file using its hash. See [DVC-File Format](/doc/user-guide/dvc-file-format) for more details. > Note that DVC-files created by this command are considered _orphans_ because @@ -150,7 +150,7 @@ meta: # Special field to contain arbitary user data email: john@xyz.com ``` -This is a standard DVC-file with only an `outs` entry. The checksum should +This is a standard DVC-file with only an `outs` entry. The hash value should correspond to an entry in the cache. > Note that the `meta` values above were entered manually for this example. Meta @@ -197,7 +197,7 @@ Saving information to 'pics.dvc'. There are no [DVC-files](/doc/user-guide/dvc-file-format) generated within this directory structure, but the images are all added to the cache. DVC -prints a message about this, mentioning that MD5 checksums are computed for each +prints a message about this, mentioning that MD5 hashes are computed for each directory. A single `pics.dvc` DVC-file is generated for the top-level directory, and it contains: diff --git a/public/static/docs/command-reference/checkout.md b/public/static/docs/command-reference/checkout.md index 4853ad4e12..1585a01531 100644 --- a/public/static/docs/command-reference/checkout.md +++ b/public/static/docs/command-reference/checkout.md @@ -31,7 +31,7 @@ The execution of `dvc checkout` does the following: - Scans the DVC-files to compare against the data files or directories in the workspace. DVC knows which data (outputs) match - because their checksums are saved in the `outs` fields inside the DVC-files. + because their hash values are saved in the `outs` fields inside the DVC-files. Scanning is limited to the given `targets` (if any). See also options `--with-deps` and `--recursive` below. @@ -134,7 +134,7 @@ The workspace looks like this: This project comes with a predefined HTTP [remote storage](/doc/command-reference/remote). We can now just run `dvc pull` that will fetch and checkout the most recent `model.pkl`, `data.xml`, and other -files that are under DVC control. The model file checksum +files that are under DVC control. The model file hash `3863d0e317dee0a55c4e59d2ec0eef33` will be used in the `train.dvc` [stage file](/doc/command-reference/run): @@ -168,7 +168,7 @@ outs: path: model.pkl ``` -But if you check `model.pkl`, the file checksum is still the same: +But if you check `model.pkl`, the file hash is still the same: ```dvc $ md5 model.pkl diff --git a/public/static/docs/command-reference/commit.md b/public/static/docs/command-reference/commit.md index 2535d40e43..74c7bedd5c 100644 --- a/public/static/docs/command-reference/commit.md +++ b/public/static/docs/command-reference/commit.md @@ -46,8 +46,8 @@ Let's take a look at what is happening in the fist scenario closely. Normally DVC commands like `dvc add`, `dvc repro` or `dvc run` commit the data to the cache after creating a DVC-file. What _commit_ means is that DVC: -- Computes a checksum for the file/directory. -- Enters the checksum and file name into the DVC-file. +- Computes a hash for the file/directory. +- Enters the hash value and file name into the DVC-file. - Tells Git to ignore the file/directory (adding an entry to `.gitignore`). (Note that if the project was initialized with no SCM support (`dvc init --no-scm`), this does not happen.) @@ -56,10 +56,10 @@ DVC commands like `dvc add`, `dvc repro` or `dvc run` commit the data to the There are many cases where the last step is not desirable (for example rapid iterations on an experiment). The `--no-commit` option prevents the last step from occurring (on the commands where it's available), saving time and space by -not storing unwanted data artifacts. Checksums is still computed -and added to the DVC-file, but the actual data file is not saved in the cache. -This is where the `dvc commit` command comes into play. It performs that last -step (saving the data in cache). +not storing unwanted data artifacts. The file hash is still +computed and added to the DVC-file, but the actual data file is not saved in the +cache. This is where the `dvc commit` command comes into play. It performs that +last step (saving the data in cache). Note that it's best to avoid the last two scenarios. They essentially force-update the [DVC-files](/doc/user-guide/dvc-file-format) and save data to @@ -78,7 +78,7 @@ reproducibility in those cases. for this option to have effect. Determines the files to commit by searching each target directory and its subdirectories for DVC-files to inspect. -- `-f`, `--force` - commit data even if checksums for dependencies or outputs +- `-f`, `--force` - commit data even if hash values for dependencies or outputs did not change. - `-h`, `--help` - prints the usage/help message, and exit. @@ -193,7 +193,7 @@ wdir: . To verify this instance of `model.pkl` is not in the cache, we must know the path to the cached file. In the cache directory, the first two characters of the -checksum are used as a subdirectory name, and the remaining characters are the +hash value are used as a subdirectory name, and the remaining characters are the file name. Therefore, had the file been committed to the cache, it would appear in the directory `.dvc/cache/70`. Let's check: diff --git a/public/static/docs/command-reference/config.md b/public/static/docs/command-reference/config.md index 29f3d3b47d..e7f76c7195 100644 --- a/public/static/docs/command-reference/config.md +++ b/public/static/docs/command-reference/config.md @@ -76,7 +76,7 @@ This is the main section with the general config options: [anonymized usage statistics](/doc/user-guide/analytics). Accepts values `true` (default) and `false`. -- `core.checksum_jobs` - number of threads for computing checksums. Accepts +- `core.checksum_jobs` - number of threads for computing file hashes. Accepts positive integers. The default value is `max(1, min(4, cpu_count() // 2))`. - `core.hardlink_lock` - use hardlink file locks instead of the default ones, @@ -161,9 +161,9 @@ for more details.) This section contains the following options: > Avoid using the same remote location that you are using for `dvc push`, > `dvc pull`, `dvc fetch` as external cache for your external outputs, because - > it may cause possible checksum overlaps. Checksum for some data file on an - > external storage can potentially collide with checksum generated locally for - > a different file, with a different content. + > it may cause possible file hash overlaps: the hash of a data file in + > external storage could collide with a hash generated locally for another + > file with a different content. - `cache.s3` - name of an [Amazon S3 remote to use as external cache](/doc/user-guide/managing-external-data#amazon-s-3). @@ -184,9 +184,9 @@ learn more about the state file (database) that is used for optimization. - `state.row_limit` - maximum number of entries in the state database, which affects the physical size of the state file itself, as well as the performance - of certain DVC operations. The bigger the limit the more checksum history DVC - can keep in order to avoid sequential checksum recalculations for the files. - Default limit is set to 10 000 000 rows. + of certain DVC operations. The bigger the limit, the longer the file hash + history that DVC can keep, in order to avoid sequential hash recalculations. + The default limit is set to 10,000,000 rows. - `state.row_cleanup_quota` - percentage of the state database that is going to be deleted when it hits the `state.row_limit`. When an entry in the database diff --git a/public/static/docs/command-reference/fetch.md b/public/static/docs/command-reference/fetch.md index 8f9f769e45..207fc61775 100644 --- a/public/static/docs/command-reference/fetch.md +++ b/public/static/docs/command-reference/fetch.md @@ -64,11 +64,10 @@ for more information on how to configure different remote storage providers. `dvc fetch`, `dvc pull`, and `dvc push` are related in that these 3 commands perform data synchronization among local and remote storage. The specific way in which the set of files to push/fetch/pull is determined begins with calculating -the checksums of the files in question, when these are -[added](/doc/get-started/add-files) to DVC. File checksums are then stored in -the corresponding DVC-files (usually saved in a Git branch). Only the checksums -specified in DVC-files currently in the project are considered by `dvc fetch` -(unless the `-a` or `-T` options are used). +file hashes when these are [added](/doc/get-started/add-files) to DVC. File +hashes are stored in the corresponding DVC-files (typically versioned with Git). +Only the hashes specified in DVC-files currently in the workspace are considered +by `dvc fetch` (unless the `-a` or `-T` options are used). ## Options @@ -103,7 +102,7 @@ specified in DVC-files currently in the project are considered by `dvc fetch` - `-T`, `--all-tags` - fetch cache for all Git tags. Similar to `-a` above. Note that both options can be combined, for example using the `-aT` flag. -- `--show-checksums` - show checksums instead of file names when printing the +- `--show-checksums` - show file hashes instead of file names when printing the download progress. * `-h`, `--help` - prints the usage/help message, and exit. @@ -190,8 +189,8 @@ $ tree .dvc > remote. As seen above, used without arguments, `dvc fetch` downloads all assets needed -by all DVC-files in the current branch, including for directories. The checksums -`3863d0e317dee0a55c4e59d2ec0eef33` and `42c7025fc0edeb174069280d17add2d4` +by all DVC-files in the current branch, including for directories. The hash +values `3863d0e317dee0a55c4e59d2ec0eef33` and `42c7025fc0edeb174069280d17add2d4` correspond to the `model.pkl` file and `data/features/` directory, respectively. Let's now link files from the cache to the workspace with: @@ -228,8 +227,8 @@ $ tree .dvc/cache > Note that `prepare.dvc` is the first stage in our example's pipeline. Cache entries for the necessary directories, as well as the actual -`data/prepared/test.tsv` and `data/prepared/train.tsv` files were download, -checksums shown above. +`data/prepared/test.tsv` and `data/prepared/train.tsv` files were downloaded. +Their hash values are shown above. ## Example: With dependencies diff --git a/public/static/docs/command-reference/gc.md b/public/static/docs/command-reference/gc.md index c148a7d143..7cf7f080e4 100644 --- a/public/static/docs/command-reference/gc.md +++ b/public/static/docs/command-reference/gc.md @@ -78,7 +78,7 @@ $ du -sh .dvc/cache/ ``` When you run `dvc gc` it removes all objects from cache that are not referenced -in the workspace (by collecting hash sums from the DVC-files): +in the workspace (by collecting hash values from the DVC-files): ```dvc $ dvc gc diff --git a/public/static/docs/command-reference/get.md b/public/static/docs/command-reference/get.md index 16e3243086..4618a6964f 100644 --- a/public/static/docs/command-reference/get.md +++ b/public/static/docs/command-reference/get.md @@ -136,7 +136,7 @@ https://remote.dvc.org/get-started/66/2eb7f64216d9c2c1088d0a5e2c6951 `remote.dvc.org/get-started` is an HTTP [DVC remote](/doc/command-reference/remote), whereas -`662eb7f64216d9c2c1088d0a5e2c6951` is the file's checksum. +`662eb7f64216d9c2c1088d0a5e2c6951` is the file hash. ## Example: Compare different versions of data or model diff --git a/public/static/docs/command-reference/import-url.md b/public/static/docs/command-reference/import-url.md index 68c9e02029..69198e910b 100644 --- a/public/static/docs/command-reference/import-url.md +++ b/public/static/docs/command-reference/import-url.md @@ -241,8 +241,8 @@ outs: The DVC-file is nearly the same as in the previous example. The difference is that the dependency (`deps`) now references the local file in the "datastore" directory we created previously. (Its `path` has the URL for the datastore.) And -instead of an `etag` we have an `md5` checksum. We did this so its easy to edit -the data file. +instead of an `etag`, we have an `md5` hash value. We did this so its easy to +edit the data file. Let's now manually reproduce a [processing chapter](/doc/get-started/connect-code-and-data) from the _Get @@ -306,8 +306,8 @@ Data and pipelines are up to date. In the datastore directory, edit `data.xml`. It doesn't matter what you change, as long as it remains a valid XML file, because any change will result in a -different dependency file checksum (`md5`) in the import stage DVC-file. Once we -do so, we can run `dvc update` to make sure the import stage is up to date: +different dependency file hash (`md5`) in the import stage DVC-file. Once we do +so, we can run `dvc update` to make sure the import stage is up to date: ```dvc $ dvc update data.xml.dvc diff --git a/public/static/docs/command-reference/install.md b/public/static/docs/command-reference/install.md index 90f0cac573..be34f8b5b5 100644 --- a/public/static/docs/command-reference/install.md +++ b/public/static/docs/command-reference/install.md @@ -162,7 +162,6 @@ featurize.dvc: $ dvc checkout $ dvc status - Data and pipelines are up to date. ``` @@ -173,10 +172,11 @@ running `git checkout master`. We also see that the first `dvc status` tells us about differences between the project cache and the data files currently in the workspace. Git changed the DVC-files in the workspace, which changed references to data files. -What `dvc status` did is inform us the data files in the workspace no longer -matched the checksums in the [DVC-files](/doc/user-guide/dvc-file-format). -Running `dvc checkout` then checks out the corresponding data files, and a -second `dvc status` now tells us the data files match the DVC-files. +`dvc status` first informed us that the data files in the workspace no longer +matched the hash values in the corresponding +[DVC-files](/doc/user-guide/dvc-file-format). Running `dvc checkout` then brings +them up to date, and a second `dvc status` tells us that the data files now do +match the DVC-files. ```dvc $ git checkout master diff --git a/public/static/docs/command-reference/remote/add.md b/public/static/docs/command-reference/remote/add.md index 29337185b6..cdb7692f2d 100644 --- a/public/static/docs/command-reference/remote/add.md +++ b/public/static/docs/command-reference/remote/add.md @@ -242,9 +242,8 @@ $ dvc remote add myremote gdrive://0AIac4JZqHhKmUk9PDA/my-dvc-root Note that GDrive remotes are not "trusted" by default. This means that the [`verify`](/doc/command-reference/remote/modify#available-settings-for-all-remotes) -option is enabled on this type of storage, so DVC recalculates the checksums of -files upon download (e.g. `dvc pull`), to make sure that these haven't been -modified. +option is enabled on this type of storage, so DVC recalculates the file hashes +upon download (e.g. `dvc pull`), to make sure that these haven't been modified.
diff --git a/public/static/docs/command-reference/remote/modify.md b/public/static/docs/command-reference/remote/modify.md index 2b9a1b1344..fd848f4c20 100644 --- a/public/static/docs/command-reference/remote/modify.md +++ b/public/static/docs/command-reference/remote/modify.md @@ -64,11 +64,10 @@ manual editing could be used to change the configuration. The following options are available for all remote types: - `verify` - upon downloading cache files (`dvc pull`, `dvc fetch`) - DVC will recalculate the checksums of files upon download (e.g. `dvc pull`) to - make sure that these haven't been modified, or corrupted during download. It - may slow down the aforementioned commands. The calculated checksum is compared - to the one saved in the corresponding - [DVC-file](/doc/user-guide/dvc-file-format). + DVC will recalculate the file hashes upon download (e.g. `dvc pull`) to make + sure that these haven't been modified, or corrupted during download. It may + slow down the aforementioned commands. The calculated hash is compared to the + value saved in the corresponding [DVC-file](/doc/user-guide/dvc-file-format). > Note that this option is enabled on **Google Drive** remotes by default. diff --git a/public/static/docs/command-reference/repro.md b/public/static/docs/command-reference/repro.md index ff690a63ce..ba3b282776 100644 --- a/public/static/docs/command-reference/repro.md +++ b/public/static/docs/command-reference/repro.md @@ -44,7 +44,7 @@ before running the stages that produce them. files, intermediate or final results. It saves all the data files, intermediate or final results into the DVC cache (unless `--no-commit` option is specified), and updates stage files with the new dependency/output file or -directory checksums. +directory hash values. ### Parallel stage execution @@ -240,7 +240,7 @@ Saving information to 'Dvcfile'. ``` You can now check that `Dvcfile` and `count.txt` have been updated with the new -information and updated dependency/output file checksums, and a new result, +information and updated dependency/output file hash values, and a new result, respectively. ## Example: Downstream diff --git a/public/static/docs/command-reference/run.md b/public/static/docs/command-reference/run.md index 7e9edc5090..58bdbf47ee 100644 --- a/public/static/docs/command-reference/run.md +++ b/public/static/docs/command-reference/run.md @@ -132,9 +132,9 @@ data pipeline (e.g. random numbers, time functions, hardware dependency, etc.) - `--no-exec` - create a stage file, but do not execute the `command` defined in it, nor take dependencies or outputs under DVC control. In the DVC-file - contents, the file checksums will be empty; They will be populated the next - time this stage is actually executed. This is useful if, for example, you need - to build a pipeline (dependency graph) first, and then run it all at once. + contents, the file hashes will be empty; They will be populated the next time + this stage is actually executed. This is useful if, for example, you need to + build a pipeline (dependency graph) first, and then run it all at once. - `-y`, `--yes` (_deprecated_) - See `--overwrite-dvcfile` below. diff --git a/public/static/docs/command-reference/status.md b/public/static/docs/command-reference/status.md index 2e332b20ae..0d72d82a6a 100644 --- a/public/static/docs/command-reference/status.md +++ b/public/static/docs/command-reference/status.md @@ -39,10 +39,10 @@ limited to specific DVC-files by listing them as `targets`. Changes are reported only against the given `targets`. When combined with the `--with-deps` option, a search is made for changes in other stages that affect the target. -In the `local` mode, changes are detected through the checksum of every file -listed in every DVC-file in question against the corresponding file in the file -system. The command output indicates the detected changes, if any. If no -differences are detected, `dvc status` prints this message: +In the `local` mode, changes are detected by comparing the hash value of every +file listed in every (target) DVC-file, against the computed hash of +corresponding files in the workspace. The command output lists detected changes, +if any. If no differences are detected, `dvc status` prints this message: ```dvc $ dvc status @@ -55,11 +55,11 @@ be executed by `dvc repro`. If instead, differences are detected, `dvc status` lists those changes. For each DVC-file (stage) with differences, the changes in dependencies and/or outputs that differ are listed. For each item listed, either -the file name or the checksum is shown, and additionally a status word is shown +the file name or hash is shown, and additionally a status word is shown describing the changes (described below). -- _changed checksum_ means that the DVC-file checksum has changed - (e.g. someone manually edited the file). +- _changed checksum_ means that the DVC-file hash has changed (e.g. + someone manually edited the file). - _always changed_ means that this is a DVC-file with no dependencies (an _orphan_ stage file) or that it has the `always_changed: true` value set (see @@ -71,16 +71,15 @@ describing the changes (described below). commands like `dvc commit` or `dvc repro`, `dvc run` should be run to update the file. Possible states are: - - _new_: Output exists in workspace, but there is no - corresponding checksum calculated and saved in the DVC-file for this output - yet. - - _modified_: Output or dependency exists in workspace, but the corresponding - checksum in the DVC-file is not up to date. - - _deleted_: Output or dependency does not exist in workspace, but still - referred in the DVC-file. - - _not in cache_: Output exists in workspace and the corresponding checksum in - the DVC-file is up to date, but there is no corresponding cache - entry. + - _new_: An output is found in the workspace, but there is no + corresponding file hash saved in a DVC-file yet. + - _modified_: An output or dependency is found in the workspace, + but the corresponding file hash the DVC-file is not up to date. + - _deleted_: The output or dependency is references in a DVC-file, but does + not exist in the workspace. + - _not in cache_: An output exists in workspace and the corresponding file + hash in the DVC-file is up to date, but there is no corresponding + cache entry. **For comparison against remote storage:** diff --git a/public/static/docs/get-started/add-files.md b/public/static/docs/get-started/add-files.md index 54d4bc05bd..7c555c075b 100644 --- a/public/static/docs/get-started/add-files.md +++ b/public/static/docs/get-started/add-files.md @@ -52,9 +52,9 @@ $ ls -R .dvc/cache 04afb96060aad90176268345e10355 ``` -`a304afb96060aad90176268345e10355` above is the file checksum of the `data.xml` -file we just added to DVC. If you check the `data/data.xml.dvc` DVC-file, you -will see that it has this string inside. +`a304afb96060aad90176268345e10355` above is the file hash of the `data.xml` file +we just added to DVC. If you check the `data/data.xml.dvc` DVC-file, you will +see that it has this string inside. ### Important note on cache performance diff --git a/public/static/docs/get-started/store-data.md b/public/static/docs/get-started/store-data.md index 4ce4ac2663..1512150922 100644 --- a/public/static/docs/get-started/store-data.md +++ b/public/static/docs/get-started/store-data.md @@ -35,7 +35,7 @@ $ ls -R /tmp/dvc-storage 04afb96060aad90176268345e10355 ``` -`a304afb96060aad90176268345e10355` above is the file checksum of the `data.xml` +`a304afb96060aad90176268345e10355` above is the file hash of the `data.xml` file. If you check the `data.xml.dvc` [DVC-file](/doc/user-guide/dvc-file-format), you will see that it has this string inside. diff --git a/public/static/docs/tutorials/deep/define-ml-pipeline.md b/public/static/docs/tutorials/deep/define-ml-pipeline.md index 198d0fb5dc..5899270ce6 100644 --- a/public/static/docs/tutorials/deep/define-ml-pipeline.md +++ b/public/static/docs/tutorials/deep/define-ml-pipeline.md @@ -69,7 +69,7 @@ need to run `dvc unprotect` or `dvc remove` first (see the If you take a look at the [DVC-file](/doc/user-guide/dvc-file-format) created by `dvc add`, you will see that outputs are tracked in the `outs` field. In this file, only one output is specified. The output contains the data -file path in the repository and its MD5 checksum. This checksum determines the +file path in the repository and its MD5 hash. This hash value determines the location of the actual content file in the [cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory), `.dvc/cache`. @@ -224,8 +224,8 @@ outs: Sections of the file above include: - `cmd`: The command to run -- `deps`: Dependencies with MD5 checksums -- `outs`: Outputs with MD5 checksums +- `deps`: Dependencies with MD5 hashes +- `outs`: Outputs with MD5 hashes And (as with the `dvc add` command) the `data/.gitignore` file was modified. Now it includes the unarchived command output file `Posts.xml`. @@ -242,7 +242,7 @@ Posts.xml The output file `Posts.xml` was transformed by DVC into a data file in accordance with the `-o` option. You can find the corresponding cache file with -the checksum, with a path starting in `c1/fa36d` as we can see below: +the hash value, as a path starting in `c1/fa36d`: ```dvc $ ls .dvc/cache/ diff --git a/public/static/docs/tutorials/deep/reproducibility.md b/public/static/docs/tutorials/deep/reproducibility.md index f8c19c8909..5f05fa078b 100644 --- a/public/static/docs/tutorials/deep/reproducibility.md +++ b/public/static/docs/tutorials/deep/reproducibility.md @@ -116,7 +116,7 @@ master: Let's keep the result in the repository. Later we can find out why bigrams don't add value to the current model and change that. -Many DVC-files were changed. This happened due to file checksum changes. +Many DVC-files were changed. This happened due to file hash changes. ```dvc $ git status -s @@ -233,9 +233,9 @@ CONFLICT (content): Merge conflict in Dvcfile Automatic merge failed; fix conflicts and then commit the result. ``` -The merge has a few conflicts. All of the conflicts are related to file checksum +The merge has a few conflicts. All of the conflicts are related to file hash mismatches in the branches. You can properly merge conflicts by prioritizing the -checksums from the bigrams branch: that is, by removing all checksums of the +file hashes from the bigrams branch: that is, by removing all hashes of the other branch. [Here](https://help.github.com/en/articles/resolving-a-merge-conflict-using-the-command-line) you can find a tutorial that clarifies how to do that. It is also important to @@ -245,15 +245,15 @@ remove all automatically generated =======, >>>>>>>) from `model.p.dvc` and `Dvcfile`. -Another way to solve git merge conflicts is to simply replace all checksums with -empty strings ''. The only disadvantage of this trick is that DVC will need to -recompute the outputs checksums. +Another way to solve git merge conflicts is to simply replace all file hashes +with empty strings ''. The only disadvantage of this trick is that DVC will need +to recompute the output hashes. After resolving the conflicts you need to checkout a proper version of the data files: ```dvc -# Replace conflicting checksums to empty string '' +# Replace conflicting hashes with empty string '' $ vi model.p.dvc $ vi Dvcfile $ dvc checkout diff --git a/public/static/docs/tutorials/pipelines.md b/public/static/docs/tutorials/pipelines.md index 89957e791b..603a417839 100644 --- a/public/static/docs/tutorials/pipelines.md +++ b/public/static/docs/tutorials/pipelines.md @@ -183,8 +183,8 @@ outs: ``` Just like the DVC-file we created earlier with `dvc add`, this stage file uses -checksums that point to the cache, to describe and version control dependencies -and outputs. Output `data/Posts.xml` file is saved as +`md5` hashes (that point to the cache) to describe and version control +dependencies and outputs. Output `data/Posts.xml` file is saved as `.dvc/cache/a3/04afb96060aad90176268345e10355` and linked (or copied) to the workspace, as well as added to `.gitignore`. @@ -194,7 +194,7 @@ stages) we need to apply. This is important when you run `dvc repro` to regenerate the final or intermediate result. Second, hopefully it's clear by now that the actual data is stored in the -`.dvc/cache` directory, each file having a name based on an `md5` checksum. This +`.dvc/cache` directory, each file having a name based on an `md5` hash. This cache is similar to Git's [objects database](https://git-scm.com/book/en/v2/Git-Internals-Git-Objects), but made specifically to handle large data files. diff --git a/public/static/docs/understanding-dvc/related-technologies.md b/public/static/docs/understanding-dvc/related-technologies.md index c194981fdb..2bf2dcb0b3 100644 --- a/public/static/docs/understanding-dvc/related-technologies.md +++ b/public/static/docs/understanding-dvc/related-technologies.md @@ -77,13 +77,13 @@ http://studio.ml/ - File tracking: - - DVC tracks files based on their checksum (MD5) instead of file timestamps. + - DVC tracks files based on their hashes (MD5) instead of file timestamps. This helps avoid running into heavy processes like model retraining when you checkout a previous, trained version of a model's code (Make would retrain the model). - DVC uses file timestamps and inodes for optimization. This allows DVC to - avoid recomputing all dependency files' checksums, which would be highly + avoid recomputing all dependency file hashes, which would be highly problematic when working with large files (10 GB+). ### Git-annex @@ -95,7 +95,7 @@ http://studio.ml/ - DVC can use reflinks\* or hardlinks (depending on the system) instead of symlinks to improve performance and the user experience. -- DVC optimizes checksum calculation. +- DVC optimizes file hash calculation. - Git-annex is a datafile-centric system whereas DVC is focused on providing a workflow for machine learning and reproducible experiments. When a DVC or diff --git a/public/static/docs/user-guide/dvc-file-format.md b/public/static/docs/user-guide/dvc-file-format.md index 3799677253..5763545e15 100644 --- a/public/static/docs/user-guide/dvc-file-format.md +++ b/public/static/docs/user-guide/dvc-file-format.md @@ -49,7 +49,7 @@ On the top level, `.dvc` file consists of these fields: - `cmd`: Executable command defined in this stage - `deps`: List of dependencies for this stage - `outs`: List of outputs for this stage -- `md5`: md5 checksum for this DVC-file +- `md5`: MD5 hash for this DVC-file - `locked`: Whether or not this stage is locked from reproduction - `wdir`: Directory to run command in (default `.`) - `always_changed`: Whether or not this stage should always be considered as @@ -58,8 +58,7 @@ On the top level, `.dvc` file consists of these fields: A dependency entry consists of a pair of fields: - `path`: Path to the dependency, relative to the `wdir` path (always present) -- `md5`: md5 checksum for the dependency (most - [stages](/doc/command-reference/run)) +- `md5`: MD5 hash for the dependency (most [stages](/doc/command-reference/run)) - `etag`: Strong ETag response header (only HTTP external dependencies created with `dvc import-url`) - `repo`: This entry is only for external dependencies created with @@ -80,7 +79,7 @@ A dependency entry consists of a pair of fields: An output entry consists of these fields: - `path`: Path to the output, relative to the `wdir` path -- `md5`: md5 checksum for the output +- `md5`: MD5 hash for the output - `cache`: Whether or not dvc should cache the output - `metric`: Whether or not this file is a [metric](/doc/command-reference/metrics) file diff --git a/public/static/docs/user-guide/dvc-files-and-directories.md b/public/static/docs/user-guide/dvc-files-and-directories.md index e971e23df8..d317a92ec4 100644 --- a/public/static/docs/user-guide/dvc-files-and-directories.md +++ b/public/static/docs/user-guide/dvc-files-and-directories.md @@ -25,8 +25,8 @@ operation: > needed to download or reproduce them. - `.dvc/state`: This file is used for optimization. It is a SQLite database, - that contains checksums for files tracked in a DVC project, with respective - timestamps and inodes to avoid unnecessary checksum computations. It also + that contains hash values for files tracked in a DVC project, with respective + timestamps and inodes to avoid unnecessary file hash computations. It also contains a list of links (from cache to workspace) created by DVC and is used to cleanup your workspace when calling `dvc checkout`. @@ -52,17 +52,17 @@ operation: There are two ways in which the data is stored in cache: As a single file (eg. `data.csv`), or a directory of files. -For the first case, we calculate the file's checksum, a 32 characters long -string (usually MD5). The first two characters are used to name the directory -inside `.dvc/cache`, and the rest become the file name of the cached file. For -example, if a data file `Posts.xml.zip` has checksum -`ec1d2935f811b77cc49b031b999cbf17`, its cache entry will be -`.dvc/cache/ec/1d2935f811b77cc49b031b999cbf17` locally. +For the first case, we calculate the file hash, a 32 characters long string +(usually MD5). The first two characters are used to name the directory inside +`.dvc/cache`, and the rest become the file name of the cached file. For example, +if a data file `Posts.xml.zip` has a hash value of +`ec1d2935f811b77cc49b031b999cbf17`, its local cache entry will be +`.dvc/cache/ec/1d2935f811b77cc49b031b999cbf17`. -> Note that file checksums are calculated from file contents only. 2 or more -> files with different names but the same contents can exist in the workspace -> and be tracked by DVC, but only one copy is stored in the cache. This helps -> avoid data duplication in cache and remotes. +> Note that file hashes are calculated from file contents only. 2 or more files +> with different names but the same contents can exist in the workspace and be +> tracked by DVC, but only one copy is stored in the cache. This helps avoid +> data duplication in cache and remotes. For the second case, let us consider a directory with 2 images. @@ -77,8 +77,8 @@ $ dvc add data/images ``` When running `dvc add` on this directory of images, a `data/images.dvc` -[DVC-file](/doc/user-guide/dvc-file-format) is created, containing the checksum -of the directory: +[DVC-file](/doc/user-guide/dvc-file-format) is created, containing the hash +value of the directory: ```yaml md5: 77e511dafe2178d936e54331d5d6288f @@ -104,7 +104,7 @@ $ tree .dvc/cache The cache file with `.dir` extension is a special text file that contains the mapping of files in the `data/` directory (as a JSON array), along with their -checksums. The other two cache files are the files inside `data/`. A typical +hash values. The other two cache files are the files inside `data/`. A typical `.dir` cache file looks like this: ```dvc diff --git a/public/static/docs/user-guide/managing-external-data.md b/public/static/docs/user-guide/managing-external-data.md index 7fba5a10ed..55dddc5a10 100644 --- a/public/static/docs/user-guide/managing-external-data.md +++ b/public/static/docs/user-guide/managing-external-data.md @@ -35,9 +35,8 @@ Non-cached external outputs (`-O`) do not require an external cache to be setup. > Avoid using the same remote location that you are using for `dvc push`, > `dvc pull`, `dvc fetch` as external cache for your external outputs, because -> it may cause possible checksum overlaps. Checksum for some data file on an -> external storage can potentially collide with checksum generated locally for a -> different file, with a different content. +> it may cause possible file hash overlaps: The hash value of a data file in +> external storage could collide with that generated locally for another file. ## Examples From 67bc455509272bfe8ff3d90eb062ceb05f885239 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 13 Feb 2020 16:46:41 -0600 Subject: [PATCH 27/27] term: SHA hash -> hash (Git commit context) per https://github.com/iterative/dvc.org/pull/962#discussion_r378059334 --- public/static/docs/command-reference/diff.md | 2 +- public/static/docs/command-reference/get.md | 2 +- public/static/docs/command-reference/import.md | 9 ++++----- public/static/docs/command-reference/install.md | 10 +++++----- public/static/docs/command-reference/metrics/diff.md | 2 +- public/static/docs/command-reference/update.md | 2 +- public/static/docs/command-reference/version.md | 10 +++++----- public/static/docs/install/pre-release.md | 6 +++--- public/static/docs/understanding-dvc/what-is-dvc.md | 2 +- .../docs/use-cases/versioning-data-and-model-files.md | 2 +- public/static/docs/user-guide/dvc-file-format.md | 6 +++--- 11 files changed, 26 insertions(+), 27 deletions(-) diff --git a/public/static/docs/command-reference/diff.md b/public/static/docs/command-reference/diff.md index 72898b486c..40ba5894f8 100644 --- a/public/static/docs/command-reference/diff.md +++ b/public/static/docs/command-reference/diff.md @@ -17,7 +17,7 @@ positional arguments: ## Description -Given two commit SHA hashes, branch or tag names, etc. +Given two commit hashes, branch or tag names, etc. ([references](https://git-scm.com/docs/revisions)) `a_ref` and `b_ref`, this command shows a comparative summary of basic statistics related to files tracked by DVC: how many files were deleted/changed, and the file size differences. diff --git a/public/static/docs/command-reference/get.md b/public/static/docs/command-reference/get.md index dc7f196076..28e36e3754 100644 --- a/public/static/docs/command-reference/get.md +++ b/public/static/docs/command-reference/get.md @@ -56,7 +56,7 @@ name. an existing directory is specified, then the output will be placed inside of it. -- `--rev` - commit SHA hash, branch or tag name, etc. (any +- `--rev` - commit hash, branch or tag name, etc. (any [Git revision](https://git-scm.com/docs/revisions)) of the repository to download the file or directory from. The latest commit in `master` (tip of the default branch) is used by default when this option is not specified. diff --git a/public/static/docs/command-reference/import.md b/public/static/docs/command-reference/import.md index ac9cb0c17d..92f323e5a6 100644 --- a/public/static/docs/command-reference/import.md +++ b/public/static/docs/command-reference/import.md @@ -73,7 +73,7 @@ data artifact from the source repo. an existing directory is specified, then the output will be placed inside of it. -- `--rev` - commit SHA hash, branch or tag name, etc. (any +- `--rev` - commit hash, branch or tag name, etc. (any [Git revision](https://git-scm.com/docs/revisions)) of the repository to download the file or directory from. The latest commit in `master` (tip of the default branch) is used by default when this option is not specified. @@ -159,10 +159,9 @@ deps: If `rev` is a Git branch or tag (where the commit it points to changes), the data source may have updates at a later time. To bring it up to date if so (and update `rev_lock` in the DVC-file), simply use `dvc update .dvc`. If -`rev` is a specific commit SHA hash (does not change), `dvc update` will never -have an effect on the import stage. You may **re-import** a different commit -instead, by using `dvc import` again with a different (or without) `--rev`. For -example: +`rev` is a specific commit hash (does not change), `dvc update` will never have +an effect on the import stage. You may **re-import** a different commit instead, +by using `dvc import` again with a different (or without) `--rev`. For example: ```dvc $ dvc import --rev master \ diff --git a/public/static/docs/command-reference/install.md b/public/static/docs/command-reference/install.md index 3b24256c11..6eb8ab76a4 100644 --- a/public/static/docs/command-reference/install.md +++ b/public/static/docs/command-reference/install.md @@ -21,11 +21,11 @@ etc.) doesn't have DVC initialized (no `.dvc/` directory present). Namely: -**Checkout**: For any commit SHA hash, branch or tag, `git checkout` retrieves -the [DVC-files](/doc/user-guide/dvc-file-format) corresponding to that version. -The project's DVC-files in turn refer to data stored in cache, but -not necessarily in the workspace. Normally, it would be necessary -to run `dvc checkout` to synchronize workspace and DVC-files. +**Checkout**: For any commit hash, branch or tag, `git checkout` retrieves the +[DVC-files](/doc/user-guide/dvc-file-format) corresponding to that version. The +project's DVC-files in turn refer to data stored in cache, but not +necessarily in the workspace. Normally, it would be necessary to +run `dvc checkout` to synchronize workspace and DVC-files. This hook automates running `dvc checkout`. diff --git a/public/static/docs/command-reference/metrics/diff.md b/public/static/docs/command-reference/metrics/diff.md index 2163f2a184..23a2fc3052 100644 --- a/public/static/docs/command-reference/metrics/diff.md +++ b/public/static/docs/command-reference/metrics/diff.md @@ -28,7 +28,7 @@ this command compares all existing metric files currently present in the The differences shown by this command include the new value, and numeric difference (delta) from the previous value of metrics (with 3-digit accuracy). -They're calculated between two commits (SHA hash, branch, tag, or any +They're calculated between two commits (hash, branch, tag, or any [Git revision](https://git-scm.com/docs/revisions)) for all metrics in the project, found by examining all of the [DVC-files](/doc/user-guide/dvc-file-format) in both references. diff --git a/public/static/docs/command-reference/update.md b/public/static/docs/command-reference/update.md index c4c81f1558..630c342867 100644 --- a/public/static/docs/command-reference/update.md +++ b/public/static/docs/command-reference/update.md @@ -27,7 +27,7 @@ Note that import stages are considered always locked, meaning that if you run update them. `dvc update` will not have an effect on import stages that are fixed to a commit -SHA hash (`rev` field in the DVC-file). Please refer to +hash (`rev` field in the DVC-file). Please refer to [Fixed revisions & re-importing](/doc/command-reference/import#example-fixed-revisions-re-importing) for more details. diff --git a/public/static/docs/command-reference/version.md b/public/static/docs/command-reference/version.md index 4e54bc68bb..0e4d94c299 100644 --- a/public/static/docs/command-reference/version.md +++ b/public/static/docs/command-reference/version.md @@ -16,7 +16,7 @@ system/environment: | Line | Detail | | ------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| [`DVC version`](#components-of-dvc-version) | Version of DVC (along with a Git commit SHA hash in case of a development version) | +| [`DVC version`](#components-of-dvc-version) | Version of DVC (along with a Git commit hash in case of a development version) | | `Python version` | Version of Python used in the environment where DVC is initialized | | `Platform` | Information about the operating system of the machine | | [`Binary`](#what-we-mean-by-binary) | Shows whether DVC was installed from a package or from a binary release | @@ -53,10 +53,10 @@ The detail of DVC version depends upon the way of installing DVC. that might not be ready to publish yet. Therefore installing using the above command might have issues regarding its usage. So to trace any error reported with this setup, we need to know exactly which version is being used. For this - we rely on a Git commit SHA hash, that is displayed in this command's output - like this: `0.40.2+292cab.mod`. The part before `+` is the `_BASE_VERSION`, - and the following part is the SHA of the tip of the `master` branch. The - optional suffix `.mod` means that code is modified. + we rely on a Git commit hash, that is displayed in this command's output like + this: `0.40.2+292cab.mod`. The part before `+` is the `_BASE_VERSION`, and the + following part is the SHA of the tip of the `master` branch. The optional + suffix `.mod` means that code is modified. ### What we mean by "Binary" diff --git a/public/static/docs/install/pre-release.md b/public/static/docs/install/pre-release.md index 887011b0c5..1f31b1353a 100644 --- a/public/static/docs/install/pre-release.md +++ b/public/static/docs/install/pre-release.md @@ -15,9 +15,9 @@ $ pip install git+https://github.com/iterative/dvc ``` > `gitpython` allows the installation process to generate a DVC version using -> the current Git commit SHA hash. This lets us to distinguish official DVC -> releases (e.g. `0.64.3`) from a development version (e.g. `0.64.3-9c7381`). -> For more information on our versioning convention, refer to +> the current Git commit hash. This lets us to distinguish official DVC releases +> (e.g. `0.64.3`) from a development version (e.g. `0.64.3-9c7381`). For more +> information on our versioning convention, refer to > [Components of DVC version](/doc/command-reference/version#components-of-dvc-version). To install a development version for contributing to the project, please refer diff --git a/public/static/docs/understanding-dvc/what-is-dvc.md b/public/static/docs/understanding-dvc/what-is-dvc.md index 6358ba5ff7..444d7a6774 100644 --- a/public/static/docs/understanding-dvc/what-is-dvc.md +++ b/public/static/docs/understanding-dvc/what-is-dvc.md @@ -26,7 +26,7 @@ DVC uses a few core concepts: recompute the results after a successful merge. - **Experiment state** or state: Equivalent to a Git snapshot (all committed - files). A Git commit SHA hash, branch or tag name, etc. can be used as a + files). A Git commit hash, branch or tag name, etc. can be used as a [reference](https://git-scm.com/book/en/v2/Git-Internals-Git-References) to an experiment state. diff --git a/public/static/docs/use-cases/versioning-data-and-model-files.md b/public/static/docs/use-cases/versioning-data-and-model-files.md index 5b94f215ee..68e76f68f4 100644 --- a/public/static/docs/use-cases/versioning-data-and-model-files.md +++ b/public/static/docs/use-cases/versioning-data-and-model-files.md @@ -86,7 +86,7 @@ file. Let's consider the full checkout first. It's quite straightforward: > `v1.0` below is a Git tag that should be created in advance to identify the > dataset version you are interested in. Any > [Git reference](https://git-scm.com/book/en/v2/Git-Internals-Git-References) -> (for example `HEAD^` or a commit SHA hash) can be used instead. +> (for example `HEAD^` or a commit hash) can be used instead. ```dvc $ git checkout v1.0 diff --git a/public/static/docs/user-guide/dvc-file-format.md b/public/static/docs/user-guide/dvc-file-format.md index 5763545e15..93b7d661ec 100644 --- a/public/static/docs/user-guide/dvc-file-format.md +++ b/public/static/docs/user-guide/dvc-file-format.md @@ -66,11 +66,11 @@ A dependency entry consists of a pair of fields: - `url`: URL of Git repository with source DVC project - `rev`: Only present when the `--rev` option of `dvc import` is used. - Specific commit SHA hash, branch or tag name, etc. (a + Specific commit hash, branch or tag name, etc. (a [Git revision](https://git-scm.com/docs/revisions)) used to import the dependency from. - - `rev_lock`: Git commit SHA hash of the external DVC repository - at the time of importing or updating (with `dvc update`) the dependency. + - `rev_lock`: Git commit hash of the external DVC repository at + the time of importing or updating (with `dvc update`) the dependency. > See the examples in > [External Dependencies](/doc/user-guide/external-dependencies) for more