diff --git a/public/static/docs/changelog/0.35.md b/public/static/docs/changelog/0.35.md index 20d7b69e80..f229851be2 100644 --- a/public/static/docs/changelog/0.35.md +++ b/public/static/docs/changelog/0.35.md @@ -59,9 +59,9 @@ improvements) we have done in the last few months: - ⚡️ **Performance optimizations.** The most notable one is the migration from using a plain JSON file to an (embedded) SQLLite instance, to cache file and - directory checksums. Another one is improved performance, stability and - general user experience for the commands that navigate tags or branches (all - the commands that include `--all-bracnhes`, `-a` or `--all-tags`, `-T`). + directory hashes. Another one is improved performance, stability and general + user experience for the commands that navigate tags or branches (all the + commands that include `--all-bracnhes`, `-a` or `--all-tags`, `-T`). There are new [integrations and plugins](/doc/install/plugins) available: diff --git a/public/static/docs/command-reference/add.md b/public/static/docs/command-reference/add.md index cb3c3b8b0d..e74af2c70a 100644 --- a/public/static/docs/command-reference/add.md +++ b/public/static/docs/command-reference/add.md @@ -29,16 +29,16 @@ that becomes [external outputs](/doc/user-guide/managing-external-data). Under the hood, a few actions are taken for each file (or directory) in `targets`: -1. Calculate the file checksum. +1. Calculate the file hashes. 2. Move the file contents to the cache directory (by default in `.dvc/cache`), - using the checksum to form the cached file names. (See + using the file hash to form the cached file names. (See [Structure of cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory) for more details.) 3. Attempt to replace the file by a link to the file in cache (more details below). -4. Create a corresponding DVC-file and store the checksum to identify the cached - file. Unless the `-f` option is used, the DVC-file name generated by default - is `.dvc`, where `` is the file name of the first target. +4. Create a corresponding DVC-file and store the file hash to identify the + cached file. Unless the `-f` option is used, the DVC-file name generated by + default is `.dvc`, where `` is the file name of the first target. 5. Unless `dvc init --no-scm` was used when initializing the project, add the `targets` to `.gitignore` in order to prevent them from being committed to the Git repository. @@ -48,7 +48,7 @@ Under the hood, a few actions are taken for each file (or directory) in The result is that the target data gets cached by DVC, and instead small DVC-files can be tracked with Git. The DVC-file lists the added file as an -output (`outs` field), and references the cached file using the checksum. See +output (`outs` field), and references the cached file using its hash. See [DVC-File Format](/doc/user-guide/dvc-file-format) for more details. > Note that DVC-files created by this command are considered _orphans_ because @@ -150,7 +150,7 @@ meta: # Special field to contain arbitary user data email: john@xyz.com ``` -This is a standard DVC-file with only an `outs` entry. The checksum should +This is a standard DVC-file with only an `outs` entry. The hash value should correspond to an entry in the cache. > Note that the `meta` values above were entered manually for this example. Meta diff --git a/public/static/docs/command-reference/checkout.md b/public/static/docs/command-reference/checkout.md index e5082c87f2..2947f2998c 100644 --- a/public/static/docs/command-reference/checkout.md +++ b/public/static/docs/command-reference/checkout.md @@ -33,8 +33,8 @@ The execution of `dvc checkout` does the following: - Scans the DVC-files to compare against the data files or directories in the workspace. DVC knows which data (outputs) match - because the corresponding file hash values are saved in the `outs` fields in - the DVC-files. Scanning is limited to the given `targets` (if any). See also + because the corresponding hash values are saved in the `outs` fields in the + DVC-files. Scanning is limited to the given `targets` (if any). See also options `--with-deps` and `--recursive` below. - Missing data files or directories, or those that don't match with any @@ -147,7 +147,7 @@ bigrams-experiment <- Uses bigrams to improve the model This project comes with a predefined HTTP [remote storage](/doc/command-reference/remote). We can now just run `dvc pull` that will fetch and checkout the most recent `model.pkl`, `data.xml`, and other -files that are under DVC control. The model file checksum +files that are under DVC control. The model file hash `3863d0e317dee0a55c4e59d2ec0eef33` will be used in the `train.dvc` [stage file](/doc/command-reference/run): @@ -195,10 +195,10 @@ MD5 (model.pkl) = 43630cce66a2432dcecddc9dd006d0a7 ``` What happened is that DVC went through the DVC-files and adjusted the current -set of files to match the `outs` in them. `dvc fetch` is run this once to -download missing data from the remote storage to the cache. -(Alternatively, we could have just run `dvc pull` to do `dvc fetch` + -`dvc checkout` in one step.) +set of output files to match the `outs` in them. `dvc fetch` is run +this once to download missing data from the remote storage to the +cache. (Alternatively, we could have just run `dvc pull` to do +`dvc fetch` + `dvc checkout` in one step.) ## Example: Automating DVC checkout diff --git a/public/static/docs/command-reference/commit.md b/public/static/docs/command-reference/commit.md index 868dae01a4..0f2c0861bb 100644 --- a/public/static/docs/command-reference/commit.md +++ b/public/static/docs/command-reference/commit.md @@ -49,8 +49,8 @@ Let's take a look at what is happening in the fist scenario closely. Normally DVC commands like `dvc add`, `dvc repro` or `dvc run` commit the data to the cache after creating a DVC-file. What _commit_ means is that DVC: -- Computes a checksum for the file/directory. -- Enters the checksum and file name into the DVC-file. +- Computes a hash for the file/directory. +- Enters the hash value and file name into the DVC-file. - Tells Git to ignore the file/directory (adding an entry to `.gitignore`). (Note that if the project was initialized with no SCM support (`dvc init --no-scm`), this does not happen.) @@ -59,10 +59,10 @@ DVC commands like `dvc add`, `dvc repro` or `dvc run` commit the data to the There are many cases where the last step is not desirable (for example rapid iterations on an experiment). The `--no-commit` option prevents the last step from occurring (on the commands where it's available), saving time and space by -not storing unwanted data artifacts. Checksums is still computed -and added to the DVC-file, but the actual data file is not saved in the cache. -This is where the `dvc commit` command comes into play. It performs that last -step (saving the data in cache). +not storing unwanted data artifacts. The file hash is still +computed and added to the DVC-file, but the actual data file is not saved in the +cache. This is where the `dvc commit` command comes into play. It performs that +last step (saving the data in cache). Note that it's best to avoid the last two scenarios. They essentially force-update the [DVC-files](/doc/user-guide/dvc-file-format) and save data to @@ -81,7 +81,7 @@ reproducibility in those cases. for this option to have effect. Determines the files to commit by searching each target directory and its subdirectories for DVC-files to inspect. -- `-f`, `--force` - commit data even if checksums for dependencies or outputs +- `-f`, `--force` - commit data even if hash values for dependencies or outputs did not change. - `-h`, `--help` - prints the usage/help message, and exit. @@ -196,7 +196,7 @@ wdir: . To verify this instance of `model.pkl` is not in the cache, we must know the path to the cached file. In the cache directory, the first two characters of the -checksum are used as a subdirectory name, and the remaining characters are the +hash value are used as a subdirectory name, and the remaining characters are the file name. Therefore, had the file been committed to the cache, it would appear in the directory `.dvc/cache/70`. Let's check: diff --git a/public/static/docs/command-reference/config.md b/public/static/docs/command-reference/config.md index 554529a754..11a41307b2 100644 --- a/public/static/docs/command-reference/config.md +++ b/public/static/docs/command-reference/config.md @@ -76,7 +76,7 @@ This is the main section with the general config options: [anonymized usage statistics](/doc/user-guide/analytics). Accepts values `true` (default) and `false`. -- `core.checksum_jobs` - number of threads for computing checksums. Accepts +- `core.checksum_jobs` - number of threads for computing file hashes. Accepts positive integers. The default value is `max(1, min(4, cpu_count() // 2))`. - `core.hardlink_lock` - use hardlink file locks instead of the default ones, @@ -168,9 +168,9 @@ for more details.) This section contains the following options: > Avoid using the same remote location that you are using for `dvc push`, > `dvc pull`, `dvc fetch` as external cache for your external outputs, because - > it may cause possible checksum overlaps. Checksum for some data file on an - > external storage can potentially collide with checksum generated locally for - > a different file, with a different content. + > it may cause possible file hash overlaps: the hash of a data file in + > external storage could collide with a hash generated locally for another + > file with a different content. - `cache.s3` - name of an [Amazon S3 remote to use as external cache](/doc/user-guide/managing-external-data#amazon-s-3). @@ -191,9 +191,9 @@ learn more about the state file (database) that is used for optimization. - `state.row_limit` - maximum number of entries in the state database, which affects the physical size of the state file itself, as well as the performance - of certain DVC operations. The bigger the limit the more checksum history DVC - can keep in order to avoid sequential checksum recalculations for the files. - Default limit is set to 10 000 000 rows. + of certain DVC operations. The bigger the limit, the longer the file hash + history that DVC can keep, in order to avoid sequential hash recalculations. + The default limit is set to 10,000,000 rows. - `state.row_cleanup_quota` - percentage of the state database that is going to be deleted when it hits the `state.row_limit`. When an entry in the database diff --git a/public/static/docs/command-reference/diff.md b/public/static/docs/command-reference/diff.md index c4d750dde3..40ba5894f8 100644 --- a/public/static/docs/command-reference/diff.md +++ b/public/static/docs/command-reference/diff.md @@ -17,10 +17,10 @@ positional arguments: ## Description -Given two commit SHA hashes, branch or tag names, etc. +Given two commit hashes, branch or tag names, etc. ([references](https://git-scm.com/docs/revisions)) `a_ref` and `b_ref`, this -command shows a comparative summary of basic statistics: how many files were -deleted/changed, and the file size differences. +command shows a comparative summary of basic statistics related to files tracked +by DVC: how many files were deleted/changed, and the file size differences. > Note that `dvc diff` does not show the line-to-line comparisons like > `git diff` or [GNU `diff`](https://www.gnu.org/software/diffutils/) can. This @@ -78,12 +78,12 @@ Preparing to download data from 'https://remote.dvc.org/get-started' ## Example: Previous commit in the same branch -The minimal `dvc diff` command only includes the "from" reference (`a_ref`) from -which to calculate the difference. The "until" reference (`b_ref`) defaults to -`HEAD` (current Git commit). +The minimal `dvc diff`, run without arguments, defaults to comparing DVC-tacked +files between `HEAD` (current Git commit) and the current workspace +(uncommitted changes, if any). -To see the difference with the very previous commit of the project, we can use -`HEAD^` as `a_ref`: +To see the difference between the very previous commit of the project and the +workspace, we can use `HEAD^` as `a_ref`: ```dvc $ dvc diff HEAD^ diff --git a/public/static/docs/command-reference/fetch.md b/public/static/docs/command-reference/fetch.md index 6c157c9205..f27ce175ea 100644 --- a/public/static/docs/command-reference/fetch.md +++ b/public/static/docs/command-reference/fetch.md @@ -64,11 +64,10 @@ for more information on how to configure different remote storage providers. `dvc fetch`, `dvc pull`, and `dvc push` are related in that these 3 commands perform data synchronization among local and remote storage. The specific way in which the set of files to push/fetch/pull is determined begins with calculating -the checksums of the files in question, when these are -[added](/doc/get-started/add-files) to DVC. File checksums are then stored in -the corresponding DVC-files (usually saved in a Git branch). Only the checksums -specified in DVC-files currently in the project are considered by `dvc fetch` -(unless the `-a` or `-T` options are used). +file hashes when these are [added](/doc/get-started/add-files) to DVC. File +hashes are stored in the corresponding DVC-files (typically versioned with Git). +Only the hashes specified in DVC-files currently in the workspace are considered +by `dvc fetch` (unless the `-a` or `-T` options are used). ## Options @@ -103,7 +102,7 @@ specified in DVC-files currently in the project are considered by `dvc fetch` - `-T`, `--all-tags` - fetch cache for all Git tags. Similar to `-a` above. Note that both options can be combined, for example using the `-aT` flag. -- `--show-checksums` - show checksums instead of file names when printing the +- `--show-checksums` - show file hashes instead of file names when printing the download progress. * `-h`, `--help` - prints the usage/help message, and exit. @@ -194,8 +193,8 @@ Note that the `.dvc/cache` directory was created and populated. > for more info. As seen above, used without arguments, `dvc fetch` downloads all assets needed -by all DVC-files in the current branch, including for directories. The checksums -`3863d0e317dee0a55c4e59d2ec0eef33` and `42c7025fc0edeb174069280d17add2d4` +by all DVC-files in the current branch, including for directories. The hash +values `3863d0e317dee0a55c4e59d2ec0eef33` and `42c7025fc0edeb174069280d17add2d4` correspond to the `model.pkl` file and `data/features/` directory, respectively. Let's now link files from the cache to the workspace with: @@ -232,8 +231,8 @@ $ tree .dvc/cache > Note that `prepare.dvc` is the first stage in our example's pipeline. Cache entries for the necessary directories, as well as the actual -`data/prepared/test.tsv` and `data/prepared/train.tsv` files were download, -checksums shown above. +`data/prepared/test.tsv` and `data/prepared/train.tsv` files were downloaded. +Their hash values are shown above. ## Example: With dependencies diff --git a/public/static/docs/command-reference/gc.md b/public/static/docs/command-reference/gc.md index 0509f80dcf..20ef13e36e 100644 --- a/public/static/docs/command-reference/gc.md +++ b/public/static/docs/command-reference/gc.md @@ -79,7 +79,7 @@ $ du -sh .dvc/cache/ ``` When you run `dvc gc` it removes all objects from cache that are not referenced -in the workspace (by collecting hash sums from the DVC-files): +in the workspace (by collecting hash values from the DVC-files): ```dvc $ dvc gc diff --git a/public/static/docs/command-reference/get.md b/public/static/docs/command-reference/get.md index 7844c3eff7..28e36e3754 100644 --- a/public/static/docs/command-reference/get.md +++ b/public/static/docs/command-reference/get.md @@ -56,7 +56,7 @@ name. an existing directory is specified, then the output will be placed inside of it. -- `--rev` - commit SHA hash, branch or tag name, etc. (any +- `--rev` - commit hash, branch or tag name, etc. (any [Git revision](https://git-scm.com/docs/revisions)) of the repository to download the file or directory from. The latest commit in `master` (tip of the default branch) is used by default when this option is not specified. @@ -134,7 +134,7 @@ https://remote.dvc.org/get-started/66/2eb7f64216d9c2c1088d0a5e2c6951 `remote.dvc.org/get-started` is an HTTP [DVC remote](/doc/command-reference/remote), whereas -`662eb7f64216d9c2c1088d0a5e2c6951` is the file's checksum. +`662eb7f64216d9c2c1088d0a5e2c6951` is the file hash. ## Example: Compare different versions of data or model diff --git a/public/static/docs/command-reference/import-url.md b/public/static/docs/command-reference/import-url.md index d59b65e86a..d497e59282 100644 --- a/public/static/docs/command-reference/import-url.md +++ b/public/static/docs/command-reference/import-url.md @@ -241,7 +241,7 @@ outs: The DVC-file is nearly the same as in the previous example. The difference is that the dependency (`deps`) now references the local file in the data store directory we created previously. (Its `path` has the URL for the data store.) -And instead of an `etag` we have an `md5` checksum. We did this so its easy to +And instead of an `etag` we have an `md5` hash value. We did this so its easy to edit the data file. Let's now manually reproduce a @@ -306,8 +306,8 @@ Data and pipelines are up to date. In the data store directory, edit `data.xml`. It doesn't matter what you change, as long as it remains a valid XML file, because any change will result in a -different dependency file checksum (`md5`) in the import stage DVC-file. Once we -do so, we can run `dvc update` to make sure the import stage is up to date: +different dependency file hash (`md5`) in the import stage DVC-file. Once we do +so, we can run `dvc update` to make sure the import stage is up to date: ```dvc $ dvc update data.xml.dvc diff --git a/public/static/docs/command-reference/import.md b/public/static/docs/command-reference/import.md index ac9cb0c17d..92f323e5a6 100644 --- a/public/static/docs/command-reference/import.md +++ b/public/static/docs/command-reference/import.md @@ -73,7 +73,7 @@ data artifact from the source repo. an existing directory is specified, then the output will be placed inside of it. -- `--rev` - commit SHA hash, branch or tag name, etc. (any +- `--rev` - commit hash, branch or tag name, etc. (any [Git revision](https://git-scm.com/docs/revisions)) of the repository to download the file or directory from. The latest commit in `master` (tip of the default branch) is used by default when this option is not specified. @@ -159,10 +159,9 @@ deps: If `rev` is a Git branch or tag (where the commit it points to changes), the data source may have updates at a later time. To bring it up to date if so (and update `rev_lock` in the DVC-file), simply use `dvc update .dvc`. If -`rev` is a specific commit SHA hash (does not change), `dvc update` will never -have an effect on the import stage. You may **re-import** a different commit -instead, by using `dvc import` again with a different (or without) `--rev`. For -example: +`rev` is a specific commit hash (does not change), `dvc update` will never have +an effect on the import stage. You may **re-import** a different commit instead, +by using `dvc import` again with a different (or without) `--rev`. For example: ```dvc $ dvc import --rev master \ diff --git a/public/static/docs/command-reference/install.md b/public/static/docs/command-reference/install.md index 2e258e557e..6eb8ab76a4 100644 --- a/public/static/docs/command-reference/install.md +++ b/public/static/docs/command-reference/install.md @@ -21,11 +21,11 @@ etc.) doesn't have DVC initialized (no `.dvc/` directory present). Namely: -**Checkout**: For any commit SHA hash, branch or tag, `git checkout` retrieves -the [DVC-files](/doc/user-guide/dvc-file-format) corresponding to that version. -The project's DVC-files in turn refer to data stored in cache, but -not necessarily in the workspace. Normally, it would be necessary -to run `dvc checkout` to synchronize workspace and DVC-files. +**Checkout**: For any commit hash, branch or tag, `git checkout` retrieves the +[DVC-files](/doc/user-guide/dvc-file-format) corresponding to that version. The +project's DVC-files in turn refer to data stored in cache, but not +necessarily in the workspace. Normally, it would be necessary to +run `dvc checkout` to synchronize workspace and DVC-files. This hook automates running `dvc checkout`. @@ -174,10 +174,11 @@ running `git checkout master`. We also see that the first `dvc status` tells us about differences between the project's cache and the data files currently in the workspace. Git changed the DVC-files in the workspace, which changed references to data files. -What `dvc status` did is inform us the data files in the workspace no longer -matched the checksums in the [DVC-files](/doc/user-guide/dvc-file-format). -Running `dvc checkout` then checks out the corresponding data files, and a -second `dvc status` now tells us the data files match the DVC-files. +`dvc status` first informed us that the data files in the workspace no longer +matched the hash values in the corresponding +[DVC-files](/doc/user-guide/dvc-file-format). Running `dvc checkout` then brings +them up to date, and a second `dvc status` tells us that the data files now do +match the DVC-files. ```dvc $ git checkout master diff --git a/public/static/docs/command-reference/remote/add.md b/public/static/docs/command-reference/remote/add.md index f5343e9dee..c0249e6893 100644 --- a/public/static/docs/command-reference/remote/add.md +++ b/public/static/docs/command-reference/remote/add.md @@ -213,9 +213,8 @@ $ dvc remote modify myremote gdrive_client_secret Note that GDrive remotes are not "trusted" by default. This means that the [`verify`](/doc/command-reference/remote/modify#available-settings-for-all-remotes) -option is enabled on this type of storage, so DVC recalculates the checksums of -files upon download (e.g. `dvc pull`), to make sure that these haven't been -modified. +option is enabled on this type of storage, so DVC recalculates the file hashes +upon download (e.g. `dvc pull`), to make sure that these haven't been modified. > Please note our [Privacy Policy (Google APIs)](/doc/user-guide/privacy). diff --git a/public/static/docs/command-reference/remote/modify.md b/public/static/docs/command-reference/remote/modify.md index 6cfad99443..40b884b8c4 100644 --- a/public/static/docs/command-reference/remote/modify.md +++ b/public/static/docs/command-reference/remote/modify.md @@ -64,11 +64,10 @@ manual editing could be used to change the configuration. The following options are available for all remote types: - `verify` - upon downloading cache files (`dvc pull`, `dvc fetch`) - DVC will recalculate the checksums of files upon download (e.g. `dvc pull`) to - make sure that these haven't been modified, or corrupted during download. It - may slow down the aforementioned commands. The calculated checksum is compared - to the one saved in the corresponding - [DVC-file](/doc/user-guide/dvc-file-format). + DVC will recalculate the file hashes upon download (e.g. `dvc pull`) to make + sure that these haven't been modified, or corrupted during download. It may + slow down the aforementioned commands. The calculated hash is compared to the + value saved in the corresponding [DVC-file](/doc/user-guide/dvc-file-format). > Note that this option is enabled on **Google Drive** remotes by default. diff --git a/public/static/docs/command-reference/repro.md b/public/static/docs/command-reference/repro.md index a25078d862..4f26a4dad0 100644 --- a/public/static/docs/command-reference/repro.md +++ b/public/static/docs/command-reference/repro.md @@ -244,7 +244,7 @@ Saving information to 'Dvcfile'. ``` You can now check that `Dvcfile` and `count.txt` have been updated with the new -information and updated dependency/output file checksums, and a new result, +information and updated dependency/output file hash values, and a new result, respectively. ## Example: Downstream diff --git a/public/static/docs/command-reference/status.md b/public/static/docs/command-reference/status.md index 0b43fe9554..3b990354da 100644 --- a/public/static/docs/command-reference/status.md +++ b/public/static/docs/command-reference/status.md @@ -40,7 +40,7 @@ The comparison can be limited to certain DVC-files only, by listing them as the `--with-deps` option, a search is made for changes in other stages that affect each target. -In the `local` mode, changes are detected through the checksum of every file +In the `local` mode, changes are detected through the hash value of every file listed in every DVC-file in question against the corresponding file in the file system. The command output indicates the detected changes, if any. If no differences are detected, `dvc status` prints this message: @@ -56,11 +56,11 @@ be executed by `dvc repro`. If instead, differences are detected, `dvc status` lists those changes. For each DVC-file (stage) with differences, the changes in dependencies and/or outputs that differ are listed. For each item listed, either -the file name or the checksum is shown, and additionally a status word is shown +the file name or hash is shown, and additionally a status word is shown describing the changes (described below). -- _changed checksum_ means that the DVC-file checksum has changed - (e.g. someone manually edited the file). +- _changed checksum_ means that the DVC-file hash has changed (e.g. + someone manually edited the file). - _always changed_ means that this is a DVC-file with no dependencies (an _orphan_ stage file) or that it has the `always_changed: true` value set (see @@ -72,16 +72,15 @@ describing the changes (described below). commands like `dvc commit` or `dvc repro`, `dvc run` should be run to update the file. Possible states are: - - _new_: Output exists in workspace, but there is no - corresponding checksum calculated and saved in the DVC-file for this output - yet. - - _modified_: Output or dependency exists in workspace, but the corresponding - checksum in the DVC-file is not up to date. - - _deleted_: Output or dependency does not exist in workspace, but still - referred in the DVC-file. - - _not in cache_: Output exists in workspace and the corresponding checksum in - the DVC-file is up to date, but there is no corresponding cache - entry. + - _new_: An output is found in the workspace, but there is no + corresponding file hash saved in a DVC-file yet. + - _modified_: An output or dependency is found in the workspace, + but the corresponding file hash the DVC-file is not up to date. + - _deleted_: The output or dependency is references in a DVC-file, but does + not exist in the workspace. + - _not in cache_: An output exists in workspace and the corresponding file + hash in the DVC-file is up to date, but there is no corresponding + cache entry. **For comparison against remote storage:** diff --git a/public/static/docs/command-reference/update.md b/public/static/docs/command-reference/update.md index c4c81f1558..630c342867 100644 --- a/public/static/docs/command-reference/update.md +++ b/public/static/docs/command-reference/update.md @@ -27,7 +27,7 @@ Note that import stages are considered always locked, meaning that if you run update them. `dvc update` will not have an effect on import stages that are fixed to a commit -SHA hash (`rev` field in the DVC-file). Please refer to +hash (`rev` field in the DVC-file). Please refer to [Fixed revisions & re-importing](/doc/command-reference/import#example-fixed-revisions-re-importing) for more details. diff --git a/public/static/docs/command-reference/version.md b/public/static/docs/command-reference/version.md index 7d6f0c2221..0e4d94c299 100644 --- a/public/static/docs/command-reference/version.md +++ b/public/static/docs/command-reference/version.md @@ -16,8 +16,8 @@ system/environment: | Line | Detail | | ------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| [`DVC version`](#components-of-dvc-version) | Version of DVC (along with a Git commit SHA hash in case of a development version) | -| `Python version` | Version of Python being used in the environment where DVC is initialized | +| [`DVC version`](#components-of-dvc-version) | Version of DVC (along with a Git commit hash in case of a development version) | +| `Python version` | Version of Python used in the environment where DVC is initialized | | `Platform` | Information about the operating system of the machine | | [`Binary`](#what-we-mean-by-binary) | Shows whether DVC was installed from a package or from a binary release | | `Package manager` | Name of the package manager used to install DVC if any (`pip`, `conda`, etc) | @@ -53,10 +53,10 @@ The detail of DVC version depends upon the way of installing DVC. that might not be ready to publish yet. Therefore installing using the above command might have issues regarding its usage. So to trace any error reported with this setup, we need to know exactly which version is being used. For this - we rely on a Git commit SHA hash, that is displayed in this command's output - like this: `0.40.2+292cab.mod`. The part before `+` is the `_BASE_VERSION`, - and the following part is the SHA of the tip of the `master` branch. The - optional suffix `.mod` means that code is modified. + we rely on a Git commit hash, that is displayed in this command's output like + this: `0.40.2+292cab.mod`. The part before `+` is the `_BASE_VERSION`, and the + following part is the SHA of the tip of the `master` branch. The optional + suffix `.mod` means that code is modified. ### What we mean by "Binary" diff --git a/public/static/docs/install/pre-release.md b/public/static/docs/install/pre-release.md index 887011b0c5..1f31b1353a 100644 --- a/public/static/docs/install/pre-release.md +++ b/public/static/docs/install/pre-release.md @@ -15,9 +15,9 @@ $ pip install git+https://github.com/iterative/dvc ``` > `gitpython` allows the installation process to generate a DVC version using -> the current Git commit SHA hash. This lets us to distinguish official DVC -> releases (e.g. `0.64.3`) from a development version (e.g. `0.64.3-9c7381`). -> For more information on our versioning convention, refer to +> the current Git commit hash. This lets us to distinguish official DVC releases +> (e.g. `0.64.3`) from a development version (e.g. `0.64.3-9c7381`). For more +> information on our versioning convention, refer to > [Components of DVC version](/doc/command-reference/version#components-of-dvc-version). To install a development version for contributing to the project, please refer diff --git a/public/static/docs/tutorials/deep/define-ml-pipeline.md b/public/static/docs/tutorials/deep/define-ml-pipeline.md index 198d0fb5dc..5899270ce6 100644 --- a/public/static/docs/tutorials/deep/define-ml-pipeline.md +++ b/public/static/docs/tutorials/deep/define-ml-pipeline.md @@ -69,7 +69,7 @@ need to run `dvc unprotect` or `dvc remove` first (see the If you take a look at the [DVC-file](/doc/user-guide/dvc-file-format) created by `dvc add`, you will see that outputs are tracked in the `outs` field. In this file, only one output is specified. The output contains the data -file path in the repository and its MD5 checksum. This checksum determines the +file path in the repository and its MD5 hash. This hash value determines the location of the actual content file in the [cache directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory), `.dvc/cache`. @@ -224,8 +224,8 @@ outs: Sections of the file above include: - `cmd`: The command to run -- `deps`: Dependencies with MD5 checksums -- `outs`: Outputs with MD5 checksums +- `deps`: Dependencies with MD5 hashes +- `outs`: Outputs with MD5 hashes And (as with the `dvc add` command) the `data/.gitignore` file was modified. Now it includes the unarchived command output file `Posts.xml`. @@ -242,7 +242,7 @@ Posts.xml The output file `Posts.xml` was transformed by DVC into a data file in accordance with the `-o` option. You can find the corresponding cache file with -the checksum, with a path starting in `c1/fa36d` as we can see below: +the hash value, as a path starting in `c1/fa36d`: ```dvc $ ls .dvc/cache/ diff --git a/public/static/docs/tutorials/deep/reproducibility.md b/public/static/docs/tutorials/deep/reproducibility.md index 9cb3997fa8..2194bc6cb0 100644 --- a/public/static/docs/tutorials/deep/reproducibility.md +++ b/public/static/docs/tutorials/deep/reproducibility.md @@ -116,7 +116,7 @@ master: Let's keep the result in the repository. Later we can find out why bigrams don't add value to the current model and change that. -Many DVC-files were changed. This happened due to file checksum changes. +Many DVC-files were changed. This happened due to file hash changes. ```dvc $ git status -s @@ -232,9 +232,9 @@ CONFLICT (content): Merge conflict in Dvcfile Automatic merge failed; fix conflicts and then commit the result. ``` -The merge has a few conflicts. All of the conflicts are related to file checksum +The merge has a few conflicts. All of the conflicts are related to file hash mismatches in the branches. You can properly merge conflicts by prioritizing the -checksums from the bigrams branch: that is, by removing all checksums of the +file hashes from the bigrams branch: that is, by removing all hashes of the other branch. [Here](https://help.github.com/en/articles/resolving-a-merge-conflict-using-the-command-line) you can find a tutorial that clarifies how to do that. It is also important to @@ -244,15 +244,15 @@ remove all automatically generated =======, >>>>>>>) from `model.p.dvc` and `Dvcfile`. -Another way to solve git merge conflicts is to simply replace all checksums with -empty strings ''. The only disadvantage of this trick is that DVC will need to -recompute the outputs checksums. +Another way to solve git merge conflicts is to simply replace all file hashes +with empty strings ''. The only disadvantage of this trick is that DVC will need +to recompute the output hashes. After resolving the conflicts you need to checkout a proper version of the data files: ```dvc -# Replace conflicting checksums to empty string '' +# Replace conflicting hashes with empty string '' $ vi model.p.dvc $ vi Dvcfile $ dvc checkout diff --git a/public/static/docs/tutorials/pipelines.md b/public/static/docs/tutorials/pipelines.md index b1ffb0bcaf..a6078cce1b 100644 --- a/public/static/docs/tutorials/pipelines.md +++ b/public/static/docs/tutorials/pipelines.md @@ -183,8 +183,8 @@ outs: ``` Just like the DVC-file we created earlier with `dvc add`, this stage file uses -checksums that point to the cache, to describe and version control dependencies -and outputs. Output `data/Posts.xml` file is saved as +`md5` hashes (that point to the cache) to describe and version control +dependencies and outputs. Output `data/Posts.xml` file is saved as `.dvc/cache/a3/04afb96060aad90176268345e10355` and linked (or copied) to the workspace, as well as added to `.gitignore`. diff --git a/public/static/docs/tutorials/versioning.md b/public/static/docs/tutorials/versioning.md index c93c6c6de3..a649c89d59 100644 --- a/public/static/docs/tutorials/versioning.md +++ b/public/static/docs/tutorials/versioning.md @@ -312,7 +312,7 @@ When you have a script that takes some data as an input and produces other data outputs, a better way to capture them is to use `dvc run`: > If you tried the commands in the -> [Switching between data or model versions](#switching-between-data-or-model-versions) +> [Switching between data and/or model versions](#switching-between-data-and-or-model-versions) > section, go back to the master branch code and data with: > > ```dvc diff --git a/public/static/docs/understanding-dvc/related-technologies.md b/public/static/docs/understanding-dvc/related-technologies.md index 185580d9dc..b03e79fbf6 100644 --- a/public/static/docs/understanding-dvc/related-technologies.md +++ b/public/static/docs/understanding-dvc/related-technologies.md @@ -74,13 +74,13 @@ Luigi, etc. - File tracking: - - DVC tracks files based on their checksum (MD5) instead of file timestamps. + - DVC tracks files based on their hashes (MD5) instead of file timestamps. This helps avoid running into heavy processes like model retraining when you checkout a previously trained version of a model (Make would retrain the model). - DVC uses file timestamps and inodes for optimization. This allows DVC to - avoid recomputing all dependency files' checksums, which would be highly + avoid recomputing all dependency file hashes, which would be highly problematic when working with large files (10 GB+). ### Git-annex @@ -92,7 +92,7 @@ Luigi, etc. - DVC can use reflinks\* or hardlinks (depending on the system) instead of symlinks to improve performance and the user experience. -- DVC optimizes checksum calculation. +- DVC optimizes file hash calculation. - Git-annex is a datafile-centric system whereas DVC is focused on providing a workflow for machine learning and reproducible experiments. When a DVC or diff --git a/public/static/docs/understanding-dvc/what-is-dvc.md b/public/static/docs/understanding-dvc/what-is-dvc.md index f2fd0533ed..444d7a6774 100644 --- a/public/static/docs/understanding-dvc/what-is-dvc.md +++ b/public/static/docs/understanding-dvc/what-is-dvc.md @@ -19,14 +19,14 @@ branch or commit. DVC uses a few core concepts: - **Experiment**: Equivalent to a - [Git-revision](https://git-scm.com/docs/revisions). Each experiment (extract + [Git revision](https://git-scm.com/docs/revisions). Each experiment (extract new features, change model hyperparameters, data cleaning, add a new data source) should be performed in a separate branch or tag. DVC allows experiments to be integrated into a Git repository history and never needs to recompute the results after a successful merge. - **Experiment state** or state: Equivalent to a Git snapshot (all committed - files). A Git commit SHA hash, branch or tag name, etc. can be used as a + files). A Git commit hash, branch or tag name, etc. can be used as a [reference](https://git-scm.com/book/en/v2/Git-Internals-Git-References) to an experiment state. diff --git a/public/static/docs/use-cases/versioning-data-and-model-files.md b/public/static/docs/use-cases/versioning-data-and-model-files.md index 5b94f215ee..68e76f68f4 100644 --- a/public/static/docs/use-cases/versioning-data-and-model-files.md +++ b/public/static/docs/use-cases/versioning-data-and-model-files.md @@ -86,7 +86,7 @@ file. Let's consider the full checkout first. It's quite straightforward: > `v1.0` below is a Git tag that should be created in advance to identify the > dataset version you are interested in. Any > [Git reference](https://git-scm.com/book/en/v2/Git-Internals-Git-References) -> (for example `HEAD^` or a commit SHA hash) can be used instead. +> (for example `HEAD^` or a commit hash) can be used instead. ```dvc $ git checkout v1.0 diff --git a/public/static/docs/user-guide/analytics.md b/public/static/docs/user-guide/analytics.md index 46946f0885..7315559cea 100644 --- a/public/static/docs/user-guide/analytics.md +++ b/public/static/docs/user-guide/analytics.md @@ -27,7 +27,7 @@ DVC's analytics record the following information per event: - The DVC version e.g. `0.82.0` - Whether DVC was installed from a binary release -- Operating system information, e.g. Ubuntu 14.04 +- Operating system information, e.g. Ubuntu Linux 14.04 - Whether the project uses Git - Command type e.g. `CmdDataPull` - Command return code e.g. `1` diff --git a/public/static/docs/user-guide/dvc-file-format.md b/public/static/docs/user-guide/dvc-file-format.md index 3799677253..93b7d661ec 100644 --- a/public/static/docs/user-guide/dvc-file-format.md +++ b/public/static/docs/user-guide/dvc-file-format.md @@ -49,7 +49,7 @@ On the top level, `.dvc` file consists of these fields: - `cmd`: Executable command defined in this stage - `deps`: List of dependencies for this stage - `outs`: List of outputs for this stage -- `md5`: md5 checksum for this DVC-file +- `md5`: MD5 hash for this DVC-file - `locked`: Whether or not this stage is locked from reproduction - `wdir`: Directory to run command in (default `.`) - `always_changed`: Whether or not this stage should always be considered as @@ -58,8 +58,7 @@ On the top level, `.dvc` file consists of these fields: A dependency entry consists of a pair of fields: - `path`: Path to the dependency, relative to the `wdir` path (always present) -- `md5`: md5 checksum for the dependency (most - [stages](/doc/command-reference/run)) +- `md5`: MD5 hash for the dependency (most [stages](/doc/command-reference/run)) - `etag`: Strong ETag response header (only HTTP external dependencies created with `dvc import-url`) - `repo`: This entry is only for external dependencies created with @@ -67,11 +66,11 @@ A dependency entry consists of a pair of fields: - `url`: URL of Git repository with source DVC project - `rev`: Only present when the `--rev` option of `dvc import` is used. - Specific commit SHA hash, branch or tag name, etc. (a + Specific commit hash, branch or tag name, etc. (a [Git revision](https://git-scm.com/docs/revisions)) used to import the dependency from. - - `rev_lock`: Git commit SHA hash of the external DVC repository - at the time of importing or updating (with `dvc update`) the dependency. + - `rev_lock`: Git commit hash of the external DVC repository at + the time of importing or updating (with `dvc update`) the dependency. > See the examples in > [External Dependencies](/doc/user-guide/external-dependencies) for more @@ -80,7 +79,7 @@ A dependency entry consists of a pair of fields: An output entry consists of these fields: - `path`: Path to the output, relative to the `wdir` path -- `md5`: md5 checksum for the output +- `md5`: MD5 hash for the output - `cache`: Whether or not dvc should cache the output - `metric`: Whether or not this file is a [metric](/doc/command-reference/metrics) file diff --git a/public/static/docs/user-guide/dvc-files-and-directories.md b/public/static/docs/user-guide/dvc-files-and-directories.md index c5ae6521a8..8ecbcfd651 100644 --- a/public/static/docs/user-guide/dvc-files-and-directories.md +++ b/public/static/docs/user-guide/dvc-files-and-directories.md @@ -25,8 +25,8 @@ operation: > needed to download or reproduce them. - `.dvc/state`: This file is used for optimization. It is a SQLite database, - that contains checksums for files tracked in a DVC project, with respective - timestamps and inodes to avoid unnecessary checksum computations. It also + that contains hash values for files tracked in a DVC project, with respective + timestamps and inodes to avoid unnecessary file hash computations. It also contains a list of links (from cache to workspace) created by DVC and is used to cleanup your workspace when calling `dvc checkout`. @@ -52,17 +52,17 @@ operation: There are two ways in which the data is stored in cache: As a single file (eg. `data.csv`), or a directory of files. -For the first case, we calculate the file's checksum, a 32 characters long -string (usually MD5). The first two characters are used to name the directory -inside `.dvc/cache`, and the rest become the file name of the cached file. For -example, if a data file `Posts.xml.zip` has checksum -`ec1d2935f811b77cc49b031b999cbf17`, its cache entry will be -`.dvc/cache/ec/1d2935f811b77cc49b031b999cbf17` locally. +For the first case, we calculate the file hash, a 32 characters long string +(usually MD5). The first two characters are used to name the directory inside +`.dvc/cache`, and the rest become the file name of the cached file. For example, +if a data file `Posts.xml.zip` has a hash value of +`ec1d2935f811b77cc49b031b999cbf17`, its local cache entry will be +`.dvc/cache/ec/1d2935f811b77cc49b031b999cbf17`. -> Note that file checksums are calculated from file contents only. 2 or more -> files with different names but the same contents can exist in the workspace -> and be tracked by DVC, but only one copy is stored in the cache. This helps -> avoid data duplication in cache and remotes. +> Note that file hashes are calculated from file contents only. 2 or more files +> with different names but the same contents can exist in the workspace and be +> tracked by DVC, but only one copy is stored in the cache. This helps avoid +> data duplication in cache and remotes. For the second case, let us consider a directory with 2 images. @@ -77,8 +77,8 @@ $ dvc add data/images ``` When running `dvc add` on this directory of images, a `data/images.dvc` -[DVC-file](/doc/user-guide/dvc-file-format) is created, containing the checksum -of the directory: +[DVC-file](/doc/user-guide/dvc-file-format) is created, containing the hash +value of the directory: ```yaml md5: 77e511dafe2178d936e54331d5d6288f @@ -104,7 +104,7 @@ $ tree .dvc/cache The cache file with `.dir` extension is a special text file that contains the mapping of files in the `data/` directory (as a JSON array), along with their -checksums. The other two cache files are the files inside `data/`. +hash values. The other two cache files are the files inside `data/`. A typical `.dir` cache file looks like this: diff --git a/public/static/docs/user-guide/managing-external-data.md b/public/static/docs/user-guide/managing-external-data.md index c508576b4f..d396148a0f 100644 --- a/public/static/docs/user-guide/managing-external-data.md +++ b/public/static/docs/user-guide/managing-external-data.md @@ -37,9 +37,8 @@ Non-cached external outputs (`-O`) do not require an external cache to be setup. > Avoid using the same remote location that you are using for `dvc push`, > `dvc pull`, `dvc fetch` as external cache for your external outputs, because -> it may cause possible checksum overlaps. Checksum for some data file on an -> external storage can potentially collide with checksum generated locally for a -> different file, with a different content. +> it may cause possible file hash overlaps: The hash value of a data file in +> external storage could collide with that generated locally for another file. ## Examples