diff --git a/content/docs/start/data-and-model-access.md b/content/docs/start/data-and-model-access.md
index eb3782db30..1fe2092183 100644
--- a/content/docs/start/data-and-model-access.md
+++ b/content/docs/start/data-and-model-access.md
@@ -4,27 +4,27 @@ title: 'Get Started: Data and Model Access'
# Get Started: Data and Model Access
-Okay, we've learned how to _track_ data and models with DVC, and how to commit
-their versions to Git. The next questions are: How can we _use_ these artifacts
-outside of the project? How do I download a model to deploy it? How to download
+We've learned how to _track_ data and models with DVC, and how to commit their
+versions to Git. The next questions are: How can we _use_ these artifacts
+outside of the project? How do we download a model to deploy it? How to download
a specific version of a model? Or reuse datasets across different projects?
> These questions tend to come up when you browse the files that DVC saves to
-> remote storage, e.g.
+> remote storage (e.g.
> `s3://dvc-public/remote/get-started/fb/89904ef053f04d64eafcc3d70db673` π±
-> instead of the original files, name such as `model.pkl` or `data.xml`.
+> instead of the original file name such as `model.pkl` or `data.xml`).
Read on or watch our video to see how to find and access models and datasets
with DVC.
https://youtu.be/EE7Gk84OZY8
-Remember those `.dvc` files `dvc add` generates? Those files (and `dvc.lock`
-that we'll cover later), have their history in Git, DVC remote storage config
-saved in Git contain all the information needed to access and download any
-version of datasets, files, and models. It means that a Git repository with
-DVC files becomes an entry point, and can be used instead of
-accessing files directly.
+Remember those `.dvc` files `dvc add` generates? Those files (and `dvc.lock`,
+which we'll cover later) have their history in Git. DVC's remote storage config
+is also saved in Git, and contains all the information needed to access and
+download any version of datasets, files, and models. It means that a Git
+repository with DVC files becomes an entry point, and can be used
+instead of accessing files directly.
## Find a file or directory
@@ -62,7 +62,7 @@ the data came from or whether new versions are available.
## Import file or directory
`dvc import` also downloads any file or directory, while also creating a `.dvc`
-file that can be saved in the project:
+file (which can be saved in the project):
```dvc
$ dvc import https://github.com/iterative/dataset-registry \
@@ -71,7 +71,7 @@ $ dvc import https://github.com/iterative/dataset-registry \
This is similar to `dvc get` + `dvc add`, but the resulting `.dvc` files
includes metadata to track changes in the source repository. This allows you to
-bring in changes from the data source later, using `dvc update`.
+bring in changes from the data source later using `dvc update`.
@@ -83,7 +83,7 @@ bring in changes from the data source later, using `dvc update`.
> `dvc import` downloads from [remote storage](/doc/command-reference/remote).
`.dvc` files created by `dvc import` have special fields, such as the data
-source `repo`, and `path` (under `deps`):
+source `repo` and `path` (under `deps`):
```git
+deps:
@@ -111,8 +111,8 @@ directly from within an application at runtime. For example:
import dvc.api
with dvc.api.open(
- 'get-started/data.xml',
- repo='https://github.com/iterative/dataset-registry'
- ) as fd:
- # ... fd is a file descriptor that can be processed normally.
+ 'get-started/data.xml',
+ repo='https://github.com/iterative/dataset-registry'
+) as fd:
+ # fd is a file descriptor which can be processed normally
```
diff --git a/content/docs/start/data-and-model-versioning.md b/content/docs/start/data-and-model-versioning.md
index 9f61fc1e05..67f7eea98f 100644
--- a/content/docs/start/data-and-model-versioning.md
+++ b/content/docs/start/data-and-model-versioning.md
@@ -13,9 +13,9 @@ and seeing data files and machine learning models in the workspace. Or switching
to a different version of a 100Gb file in less than a second with a
`git checkout`.
-The foundation of DVC consists of a few commands that you can run along with
-`git` to track large files, directories, or ML model files. Think "Git for
-data". Read on or watch our video to learn about versioning data with DVC!
+The foundation of DVC consists of a few commands you can run along with `git` to
+track large files, directories, or ML model files. Think "Git for data". Read on
+or watch our video to learn about versioning data with DVC!
https://youtu.be/kLKBcPonMYw
@@ -25,16 +25,16 @@ To start tracking a file or directory, use `dvc add`:
### βοΈ Expand to get an example dataset.
-Having initialized a project in the previous section, get the data file we will
-be using later like this:
+Having initialized a project in the previous section, we can get the data file
+(which we'll be using later) like this:
```dvc
$ dvc get https://github.com/iterative/dataset-registry \
get-started/data.xml -o data/data.xml
```
-We use the fancy `dvc get` command to jump ahead a bit and show how Git repo
-becomes a source for datasets or models - what we call "data/model registry".
+We use the fancy `dvc get` command to jump ahead a bit and show how a Git repo
+becomes a source for datasets or models β what we call a "data/model registry".
`dvc get` can download any file or directory tracked in a DVC
repository. It's like `wget`, but for DVC or Git repos. In this case we
download the latest version of the `data.xml` file from the
@@ -48,22 +48,24 @@ $ dvc add data/data.xml
```
DVC stores information about the added file (or a directory) in a special `.dvc`
-file named `data/data.xml.dvc`, a small text file with a human-readable
-[format](/doc/user-guide/project-structure/dvc-files). This file can be easily
-versioned like source code with Git, as a placeholder for the original data
-(which gets listed in `.gitignore`):
+file named `data/data.xml.dvc` β a small text file with a human-readable
+[format](/doc/user-guide/project-structure/dvc-files). This metadata file is a
+placeholder for the original data, and can be easily versioned like source code
+with Git:
```dvc
$ git add data/data.xml.dvc data/.gitignore
$ git commit -m "Add raw data"
```
+The original data, meanwhile, is listed in `.gitignore`.
+
### π‘ Expand to see what happens under the hood.
-`dvc add` moved the data to the project's cache, and linked\* it
-back to the workspace.
+`dvc add` moved the data to the project's cache, and
+linked it back to the workspace.
```dvc
$ tree .dvc/cache
@@ -82,10 +84,6 @@ outs:
path: data.xml
```
-> \* See
-> [Large Dataset Optimization](/doc/user-guide/large-dataset-optimization) and
-> `dvc config cache` for more info. on file linking.
-
## Storing and sharing
@@ -93,7 +91,7 @@ outs:
You can upload DVC-tracked data or model files with `dvc push`, so they're
safely stored [remotely](/doc/command-reference/remote). This also means they
can be retrieved on other environments later with `dvc pull`. First, we need to
-setup a storage:
+setup a remote storage location:
```dvc
$ dvc remote add -d storage s3://mybucket/dvcstore
@@ -101,16 +99,16 @@ $ git add .dvc/config
$ git commit -m "Configure remote storage"
```
-> DVC supports the following remote storage types: Google Drive, Amazon S3,
-> Azure Blob Storage, Google Cloud Storage, Aliyun OSS, SSH, HDFS, and HTTP.
-> Please refer to `dvc remote add` for more details and examples.
+> DVC supports many remote storage types, including Amazon S3, SSH, Google
+> Drive, Azure Blob Storage, and HDFS. See `dvc remote add` for more details and
+> examples.
-### βοΈ Set up a remote storage
+### βοΈ Expand to set up remote storage.
DVC remotes let you store a copy of the data tracked by DVC outside of the local
-cache, usually a cloud storage service. For simplicity, let's set up a _local
+cache (usually a cloud storage service). For simplicity, let's set up a _local
remote_:
```dvc
@@ -121,7 +119,7 @@ $ git commit .dvc/config -m "Configure local remote"
> While the term "local remote" may seem contradictory, it doesn't have to be.
> The "local" part refers to the type of location: another directory in the file
-> system. "Remote" is how we call storage for DVC projects. It's
+> system. "Remote" is what we call storage for DVC projects. It's
> essentially a local data backup.
@@ -160,7 +158,7 @@ run it after `git clone` and `git pull`.
-### βοΈ Expand to explode the project π£
+### βοΈ Expand to delete locally cached data.
If you've run `dvc push`, you can delete the cache (`.dvc/cache`) and
`data/data.xml` to experiment with `dvc pull`:
@@ -189,8 +187,8 @@ latest version:
### βοΈ Expand to make some changes.
-For the sake of simplicity let's just double the dataset artificially (and
-pretend that we got more data from some external source):
+Let's say we obtained more data from some external source. We can pretend this
+is the case by doubling the dataset:
```dvc
$ cp data/data.xml /tmp/data.xml
@@ -212,9 +210,8 @@ $ dvc push
## Switching between versions
-The regular workflow is to use `git checkout` first to switch a branch, checkout
-a commit, or a revision of a `.dvc` file, and then run `dvc checkout` to sync
-data:
+The regular workflow is to use `git checkout` first (to switch a branch or
+checkout a `.dvc` file version) and then run `dvc checkout` to sync data:
```dvc
$ git checkout <...>
@@ -225,15 +222,15 @@ $ dvc checkout
### βοΈ Expand to get the previous version of the dataset.
-Let's cleanup the previous artificial changes we made and get the previous :
+Let's go back to the original version of the data:
```dvc
-$ git checkout HEAD^1 data/data.xml.dvc
+$ git checkout HEAD~1 data/data.xml.dvc
$ dvc checkout
```
-Let's commit it (no need to do `dvc push` this time since the previous version
-of this dataset was saved before):
+Let's commit it (no need to do `dvc push` this time since this original version
+of the dataset was already saved):
```dvc
$ git commit data/data.xml.dvc -m "Revert dataset updates"
@@ -241,8 +238,8 @@ $ git commit data/data.xml.dvc -m "Revert dataset updates"
-Yes, DVC is technically not even a version control system! `.dvc` files content
-defines data file versions. Git itself provides the version control. DVC in turn
+Yes, DVC is technically not even a version control system! `.dvc` file contents
+define data file versions. Git itself provides the version control. DVC in turn
creates these `.dvc` files, updates them, and synchronizes DVC-tracked data in
the workspace efficiently to match them.
@@ -250,16 +247,13 @@ the workspace efficiently to match them.
In cases where you process very large datasets, you need an efficient mechanism
(in terms of space and performance) to share a lot of data, including different
-versions of itself. Do you use a network attached storage? Or a large external
-volume?
-
-While these cases are not covered in the Get Started, we recommend reading the
-following sections next to learn more about advanced workflows:
+versions. Do you use network attached storage (NAS)? Or a large external volume?
+You can learn more about advanced workflows using these links:
- A shared [external cache](/doc/use-cases/shared-development-server) can be set
up to store, version and access a lot of data on a large shared volume
efficiently.
- A quite advanced scenario is to track and version data directly on the remote
- storage (e.g. S3). Check out
+ storage (e.g. S3). See
[Managing External Data](https://dvc.org/doc/user-guide/managing-external-data)
to learn more.
diff --git a/content/docs/start/data-pipelines.md b/content/docs/start/data-pipelines.md
index 6923dc58b7..7849a6cb48 100644
--- a/content/docs/start/data-pipelines.md
+++ b/content/docs/start/data-pipelines.md
@@ -24,8 +24,8 @@ https://youtu.be/71IGzyH95UY
## Pipeline stages
Use `dvc run` to create _stages_. These represent processes (source code tracked
-with Git) that form the steps of a pipeline. Stages also connect code to its
-data input and output. Let's transform a Python script into a
+with Git) which form the steps of a _pipeline_. Stages also connect code to its
+corresponding data _input_ and _output_. Let's transform a Python script into a
[stage](/doc/command-reference/run):
@@ -84,7 +84,7 @@ The command options used above mean the following:
- `-n prepare` specifies a name for the stage. If you open the `dvc.yaml` file
you will see a section named `prepare`.
-- `-p prepare.seed,prepare.split` defines special types of dependencies -
+- `-p prepare.seed,prepare.split` defines special types of dependencies β
[parameters](/doc/command-reference/params). We'll get to them later in the
[Metrics, Parameters, and Plots](/doc/start/metrics-parameters-plots) page,
but the idea is that the stage can depend on field values from a parameters
@@ -119,8 +119,8 @@ prepare:
βββ ...
```
-- The last line, `python src/prepare.py ...`, is the command to run in this
- stage, and it's saved to `dvc.yaml`, as shown below.
+- The last line, `python src/prepare.py data/data.xml` is the command to run in
+ this stage, and it's saved to `dvc.yaml`, as shown below.
The resulting `prepare` stage contains all of the information above:
@@ -150,7 +150,7 @@ in this case); `dvc run` already took care of this. You only need to run
By using `dvc run` multiple times, and specifying outputs of a
stage as dependencies of another one, we can describe a sequence of
-commands that gets to a desired result. This is what we call a _data pipeline_
+commands which gets to a desired result. This is what we call a _data pipeline_
or [_dependency graph_](https://en.wikipedia.org/wiki/Directed_acyclic_graph).
Let's create a second stage chained to the outputs of `prepare`, to perform
@@ -202,8 +202,8 @@ The changes to the `dvc.yaml` should look like this:
### βοΈ Expand to add more stages.
-Let's add the training itself. Nothing new this time, the same `dvc run` command
-with the same set of options:
+Let's add the training itself. Nothing new this time; just the same `dvc run`
+command with the same set of options:
```dvc
$ dvc run -n train \
@@ -217,13 +217,13 @@ Please check the `dvc.yaml` again, it should have one more stage now.
-This should be a good point to commit the changes with Git. These include
+This should be a good time to commit the changes with Git. These include
`.gitignore`, `dvc.lock`, and `dvc.yaml` β which describe our pipeline.
## Reproduce
-The whole point of creating this `dvc.yaml` pipelines file is an ability to
-reproduce the pipeline:
+The whole point of creating this `dvc.yaml` file is the ability to easily
+reproduce a pipeline:
```dvc
$ dvc repro
@@ -231,16 +231,15 @@ $ dvc repro
-### βοΈ Expand to have some fun with it
+### βοΈ Expand to have some fun with it.
Let's try to play a little bit with it. First, let's try to change one of the
parameters for the training stage:
-```dvc
-$ vim params.yaml
-```
+1. Open `params.yaml` and change `n_est` to `100`, and
+2. (re)run `dvc repro`.
-Change `n_est` to `100` and run `dvc repro`, you should see:
+You should see:
```dvc
$ dvc repro
@@ -260,9 +259,9 @@ Stage 'prepare' didn't change, skipping
Stage 'featurize' didn't change, skipping
```
-Same as before, no need to run `prepare`, `featurize`, etc ... but, it doesn't
-run even `train` again this time either! It cached the previous run with the
-same set of inputs (parameters + data) and reused it.
+As before, there was no need to rerun `prepare`, `featurize`, etc. But this time
+it also doesn't rerun `train`! The previous run with the same set of inputs
+(parameters & data) was saved in DVC's run-cache, and reused here.
@@ -270,12 +269,12 @@ same set of inputs (parameters + data) and reused it.
### π‘ Expand to see what happens under the hood.
-`dvc repro` relies on the DAG definition that it reads from `dvc.yaml`, and uses
+`dvc repro` relies on the DAG definition from `dvc.yaml`, and uses
`dvc.lock` to determine what exactly needs to be run.
-`dvc.lock` file is similar to `.dvc` files and captures hashes (in most cases
-`md5`s) of the dependencies, values of the parameters that were used, it can be
-considered a _state_ of the pipeline:
+The `dvc.lock` file is similar to a `.dvc` file β it captures hashes (in most
+cases `md5`s) of the dependencies and values of the parameters that were used.
+It can be considered a _state_ of the pipeline:
```yaml
schema: '2.0'
@@ -304,23 +303,23 @@ stages:
DVC pipelines (`dvc.yaml` file, `dvc run`, and `dvc repro` commands) solve a few
important problems:
-- _Automation_ - run a sequence of steps in a "smart" way that makes iterating
+- _Automation_: run a sequence of steps in a "smart" way which makes iterating
on your project faster. DVC automatically determines which parts of a project
- need to be run, and it caches "runs" and their results, to avoid unnecessary
- re-runs.
-- _Reproducibility_ - `dvc.yaml` and `dvc.lock` files describe what data to use
+ need to be run, and it caches "runs" and their results to avoid unnecessary
+ reruns.
+- _Reproducibility_: `dvc.yaml` and `dvc.lock` files describe what data to use
and which commands will generate the pipeline results (such as an ML model).
Storing these files in Git makes it easy to version and share.
-- _Continuous Delivery and Continuous Integration (CI/CD) for ML_ - describing
- projects in way that it can be reproduced (built) is the first necessary step
- before introducing CI/CD systems. See our sister project,
+- _Continuous Delivery and Continuous Integration (CI/CD) for ML_: describing
+ projects in way that can be reproduced (built) is the first necessary step
+ before introducing CI/CD systems. See our sister project
[CML](https://cml.dev/) for some examples.
## Visualize
Having built our pipeline, we need a good way to understand its structure.
-Seeing a graph of connected stages would help. DVC lets you do just that,
-without leaving the terminal!
+Seeing a graph of connected stages would help. DVC lets you do so without
+leaving the terminal!
```dvc
$ dvc dag
diff --git a/content/docs/start/experiments.md b/content/docs/start/experiments.md
index 84831ccc64..7c0b1a2e58 100644
--- a/content/docs/start/experiments.md
+++ b/content/docs/start/experiments.md
@@ -8,7 +8,7 @@ title: 'Get Started: Experiments'
Experiments proliferate quickly in ML projects where there are many
parameters to tune or other permutations of the code. We can organize such
-projects and only keep what we ultimately need with `dvc experiments`. DVC can
+projects and keep only what we ultimately need with `dvc experiments`. DVC can
track experiments for you so there's no need to commit each one to Git. This way
your repo doesn't become polluted with all of them. You can discard experiments
once they're no longer needed.
@@ -18,8 +18,8 @@ once they're no longer needed.
## Running experiments
-In the previous page, we learned how to tune
-[ML pipelines](/doc/start/data-pipelines) and compare the changes. Let's further
+Previously, we learned how to tune [ML pipelines](/doc/start/data-pipelines) and
+[compare the changes](/doc/start/metrics-parameters-plots). Let's further
increase the number of features in the `featurize` stage to see how it compares.
`dvc exp run` makes it easy to change hyperparameters and run a new
@@ -31,12 +31,11 @@ $ dvc exp run --set-param featurize.max_features=3000
-### π‘ Expand to see what this command does.
+### π‘ Expand to see what happens under the hood.
`dvc exp run` is similar to `dvc repro` but with some added conveniences for
running experiments. The `--set-param` (or `-S`) flag sets the values for
-[parameters](/doc/command-reference/params) as a shortcut to editing
-`params.yaml`.
+parameters as a shortcut for editing `params.yaml`.
Check that the `featurize.max_features` value has been updated in `params.yaml`:
@@ -66,10 +65,10 @@ params.yaml featurize.max_features 3000 1500
## Queueing experiments
So far, we have been tuning the `featurize` stage, but there are also parameters
-for the `train` stage, which trains a
-[random forest classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).
+for the `train` stage (which trains a
+[random forest classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)).
-These are the `train` parameters in `params.yaml`:
+These are the `train` parameters from `params.yaml`:
```yaml
train:
@@ -78,9 +77,9 @@ train:
min_split: 2
```
-Let's setup experiments with different hyperparameters. We can define all the
-combinations we want to try without executing anything, by using the `--queue`
-flag:
+Let's setup experiments with different hyperparameters. We can use the `--queue`
+flag to define all the combinations we want to try without executing anything
+(yet):
```dvc
$ dvc exp run --queue -S train.min_split=8
@@ -95,8 +94,7 @@ $ dvc exp run --queue -S train.min_split=64 -S train.n_est=100
Queued experiment '0cdee86' for future execution.
```
-Next, run all queued experiments using `--run-all` (and in parallel with
-`--jobs`):
+Next, run all (`--run-all`) queued experiments in parallel (using `--jobs`):
```dvc
$ dvc exp run --run-all --jobs 2
@@ -108,7 +106,7 @@ To compare all of these experiments, we need more than `diff`. `dvc exp show`
compares any number of experiments in one table:
```dvc
-$ dvc exp show --no-timestamp
+$ dvc exp show --no-timestamp \
--include-params train.n_est,train.min_split
βββββββββββββββββ³βββββββββββ³ββββββββββ³βββββββββββββ³ββββββββββββββββββ
β Experiment β avg_prec β roc_auc β train.n_estβ train.min_split β
@@ -146,11 +144,11 @@ Changes for experiment 'exp-98a96' have been applied to your workspace.
-### π‘ Expand to see what this command does.
+### π‘ Expand to see what happens under the hood.
-`dvc exp apply` is similar to `dvc checkout` but it works with experiments. DVC
-tracks everything in the pipeline for each experiment (parameters, metrics,
-dependencies, and outputs) and can later retrieve it as needed.
+`dvc exp apply` is similar to `dvc checkout`, but works with experiments
+instead. DVC tracks everything in the pipeline for each experiment (parameters,
+metrics, dependencies, and outputs), retrieving things later as needed.
Check that `scores.json` reflects the metrics in the table above:
@@ -219,7 +217,7 @@ $ dvc exp pull gitremote exp-bfe64
Pulled experiment 'exp-bfe64' from Git remote 'gitremote'.
```
-> All these commands take a Git remote as an argument. A default DVC remote is
+> All these commands take a Git remote as an argument. A `dvc remote default` is
> also required to share the experiment data.
## Cleaning up
@@ -227,7 +225,7 @@ Pulled experiment 'exp-bfe64' from Git remote 'gitremote'.
Let's take another look at the experiments table:
```dvc
-$ dvc exp show --no-timestamp
+$ dvc exp show --no-timestamp \
--include-params train.n_est,train.min_split
ββββββββββββββ³βββββββββββ³ββββββββββ³βββββββββββββ³ββββββββββββββββββ
β Experiment β avg_prec β roc_auc β train.n_estβ train.min_split β
@@ -243,7 +241,7 @@ experiments since the last commit, but don't worry. The experiments remain
experiments from the previous _n_ commits:
```dvc
-$ dvc exp show -n 2 --no-timestamp
+$ dvc exp show -n 2 --no-timestamp \
--include-params train.n_est,train.min_split
βββββββββββββββββ³βββββββββββ³ββββββββββ³βββββββββββββ³ββββββββββββββββββ
β Experiment β avg_prec β roc_auc β train.n_estβ train.min_split β
@@ -266,7 +264,7 @@ Eventually, old experiments may clutter the experiments table.
```dvc
$ dvc exp gc --workspace
-$ dvc exp show -n 2 --no-timestamp
+$ dvc exp show -n 2 --no-timestamp \
--include-params train.n_est,train.min_split
ββββββββββββββ³βββββββββββ³ββββββββββ³βββββββββββββ³ββββββββββββββββββ
β Experiment β avg_prec β roc_auc β train.n_estβ train.min_split β
@@ -277,5 +275,6 @@ $ dvc exp show -n 2 --no-timestamp
ββββββββββββββ΄βββββββββββ΄ββββββββββ΄βββββββββββββ΄ββββββββββββββββββ
```
-> `dvc exp gc` only removes references to the experiments, not the cached
-> objects associated to them. To clean up the cache, use `dvc gc`.
+> `dvc exp gc` only removes references to the experiments; not the cached
+> objects associated with them. To clean up the cache, use
+> `dvc gc`.
diff --git a/content/docs/start/index.md b/content/docs/start/index.md
index 34c77a99a8..ecb46dceb8 100644
--- a/content/docs/start/index.md
+++ b/content/docs/start/index.md
@@ -15,7 +15,7 @@ running `dvc init` inside a Git project:
In expandable sections that start with the βοΈ emoji, we'll be providing more
information for those trying to run the commands. It's up to you to pick the
-best way to read the material - read the text (skip sections like this, and it
+best way to read the material β read the text (skip sections like this, and it
should be enough to understand the idea of DVC), or try to run them and get the
first hand experience.
diff --git a/content/docs/start/metrics-parameters-plots.md b/content/docs/start/metrics-parameters-plots.md
index 8654ae7f75..5bd39f924d 100644
--- a/content/docs/start/metrics-parameters-plots.md
+++ b/content/docs/start/metrics-parameters-plots.md
@@ -5,18 +5,17 @@ title: 'Get Started: Metrics, Parameters, and Plots'
# Get Started: Metrics, Parameters, and Plots
DVC makes it easy to track [metrics](/doc/command-reference/metrics), update
-[parameters](/doc/command-reference/params), and visualize performance with
-[plots](/doc/command-reference/plots). These concepts are introduced below, and
-[Experiments](/doc/start/experiments) shows how to combine them to run and
-compare many iterations of your ML project.
+parameters, and visualize performance with
+[plots](/doc/command-reference/plots). These concepts are introduced below.
-Read on to see how it's done!
+> All of the above can be combined into experiments to run and
+> compare many iterations of your ML project.
## Collecting metrics
First, let's see what is the mechanism to capture values for these ML
attributes. Let's add a final evaluation stage to our
-[pipeline](/doc/start/data-pipelines):
+[pipeline from before](/doc/start/data-pipelines):
```dvc
$ dvc run -n evaluate \
@@ -33,7 +32,7 @@ $ dvc run -n evaluate \
### π‘ Expand to see what happens under the hood.
The `-M` option here specifies a metrics file, while `--plots-no-cache`
-specifies a plots file produced by this stage that will not be
+specifies a plots file (produced by this stage) which will not be
cached by DVC. `dvc run` generates a new stage in the `dvc.yaml`
file:
@@ -57,8 +56,8 @@ evaluate:
The biggest difference to previous stages in our pipeline is in two new
sections: `metrics` and `plots`. These are used to mark certain files containing
ML "telemetry". Metrics files contain scalar values (e.g. `AUC`) and plots files
-contain matrices and data series (e.g. `ROC curves` or model loss plots) that
-are meant to be visualized and compared.
+contain matrices and data series (e.g. `ROC curves` or model loss plots) meant
+to be visualized and compared.
> With `cache: false`, DVC skips caching the output, as we want `scores.json`,
> `prc.json`, and `roc.json` to be versioned by Git.
@@ -70,15 +69,17 @@ writes the model's
[ROC-AUC](https://scikit-learn.org/stable/modules/model_evaluation.html#receiver-operating-characteristic-roc)
and
[average precision](https://scikit-learn.org/stable/modules/model_evaluation.html#precision-recall-and-f-measures)
-to `scores.json`, which is marked as a metrics file with `-M`:
+to `scores.json`, which in turn is marked as a `metrics` file with `-M`. Its
+contents are:
```json
{ "avg_prec": 0.5204838673030754, "roc_auc": 0.9032012604172255 }
```
-It also writes `precision`, `recall`, and `thresholds` arrays (obtained using
+`evaluate.py` also writes `precision`, `recall`, and `thresholds` arrays
+(obtained using
[`precision_recall_curve`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_curve.html))
-into plots file `prc.json`:
+into the plots file `prc.json`:
```json
{
@@ -94,9 +95,9 @@ Similarly, it writes arrays for the
[roc_curve](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html)
into `roc.json` for an additional plot.
-> DVC doesn't force you to use any specific file names, or even format or
-> structure of a metrics or plots file - it's pretty much user and case defined.
-> Please refer to `dvc metrics` and `dvc plots` for more details.
+> DVC doesn't force you to use any specific file names, nor does it enforce a
+> format or structure of a metrics or plots file. It's completely
+> user/case-defined. R to `dvc metrics` and `dvc plots` for more details.
You can view tracked metrics and plots with DVC. Let's start with the metrics:
@@ -133,14 +134,15 @@ $ git add scores.json prc.json roc.json
$ git commit -a -m "Create evaluation stage"
```
-Later we will see how these and other can be used to compare and visualize
-different pipeline iterations. For now, let's see how can we capture another
-important piece of information that will be useful for comparison: parameters.
+Later we will see how to
+[compare and visualize different pipeline iterations](#comparing-iterations).
+For now, let's see how can we capture another important piece of information
+which will be useful for comparison: parameters.
## Defining stage parameters
-It's pretty common for data science pipelines to include configuration files
-that define adjustable parameters to train a model, do pre-processing, etc. DVC
+It's pretty common for data science pipelines to include configuration files to
+define adjustable parameters to train a model, do pre-processing, etc. DVC
provides a mechanism for stages to depend on the values of specific sections of
such a config file (YAML, JSON, TOML, and Python formats are supported).
@@ -162,7 +164,7 @@ featurize:
-### π‘ Expand to recall how it was generated.
+### βοΈ Expand to recall how it was generated.
The `featurize` stage
[was created](/doc/start/data-pipelines#dependency-graphs-dags) with this
@@ -179,13 +181,12 @@ $ dvc run -n featurize \
-The `params` section defines the [parameter](/doc/command-reference/params)
-dependencies of the `featurize` stage. By default DVC reads those values
-(`featurize.max_features` and `featurize.ngrams`) from a `params.yaml` file. But
-as with metrics and plots, parameter file names and structure can also be user
-and case defined.
+The `params` section defines the parameter dependencies of the `featurize`
+stage. By default, DVC reads those values (`featurize.max_features` and
+`featurize.ngrams`) from a `params.yaml` file. But as with metrics and plots,
+parameter file names and structure can also be user- and case-defined.
-This is how our `params.yaml` file looks like:
+Here's the contents of our `params.yaml` file:
```yaml
prepare:
@@ -215,25 +216,25 @@ We are definitely not happy with the AUC value we got so far! Let's edit the
+ ngrams: 2
```
-And the beauty of `dvc.yaml` is that all you need to do now is to run:
+The beauty of `dvc.yaml` is that all you need to do now is run:
```dvc
$ dvc repro
```
-It'll analyze the changes, use existing cache of previous runs, and execute only
-the commands that are needed to get the new results (model, metrics, plots).
+It'll analyze the changes, use existing results from the run-cache,
+and execute only the commands needed to produce new results (model, metrics,
+plots).
The same logic applies to other possible adjustments β edit source code, update
-datasets β you do the changes, use `dvc repro`, and DVC runs what needs to be
-run.
+datasets β you do the changes, use `dvc repro`, and DVC runs what needs to be.
## Comparing iterations
Finally, let's see how the updates improved performance. DVC has a few commands
-to see metrics and parameter changes, and to visualize plots, for one or more
-pipeline iterations. Let's compare the current "bigrams" run with the last
-committed "baseline" iteration:
+to see changes in and visualize metrics, parameters, and plots. These commands
+can work for one or across multiple pipeline iteration(s). Let's compare the
+current "bigrams" run with the last committed "baseline" iteration:
```dvc
$ dvc params diff
@@ -271,5 +272,5 @@ file:///Users/dvc/example-get-started/plots.html
> [Git revisions](https://git-scm.com/docs/gitrevisions) (commits, tags, branch
> names) to compare.
-In the next page, learn advanced ways to track, organize, and compare more
-experiment iterations.
+On the next page, you can learn advanced ways to track, organize, and compare
+more experiment iterations.
diff --git a/content/docs/user-guide/basic-concepts/dependency.md b/content/docs/user-guide/basic-concepts/dependency.md
index 256e77f183..453fda173f 100644
--- a/content/docs/user-guide/basic-concepts/dependency.md
+++ b/content/docs/user-guide/basic-concepts/dependency.md
@@ -1,6 +1,6 @@
---
name: Dependency
-match: [dependency, dependencies, depends]
+match: [dependency, dependencies, depends, input]
tooltip: >-
A file or directory (possibly tracked by DVC) recorded in the `deps` section
of a stage (in `dvc.yaml`) or `.dvc` file file. See `dvc run`. Stages are
diff --git a/content/docs/user-guide/basic-concepts/dvc-cache.md b/content/docs/user-guide/basic-concepts/dvc-cache.md
index 383b1411d2..c809b4c9fc 100644
--- a/content/docs/user-guide/basic-concepts/dvc-cache.md
+++ b/content/docs/user-guide/basic-concepts/dvc-cache.md
@@ -3,7 +3,7 @@ name: 'DVC Cache'
match: ['DVC cache', cache, caches, cached, 'cache directory']
tooltip: >-
The DVC cache is a hidden storage (by default in `.dvc/cache`) for files and
- directories tracked by DVC, and their different versions. Learn more about
- it's structure
+ directories tracked by DVC, and their different versions. Learn more about its
+ structure
[here](/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory).
---
diff --git a/content/docs/user-guide/basic-concepts/experiment.md b/content/docs/user-guide/basic-concepts/experiment.md
index e4130a9683..dc66a0c360 100644
--- a/content/docs/user-guide/basic-concepts/experiment.md
+++ b/content/docs/user-guide/basic-concepts/experiment.md
@@ -4,8 +4,8 @@ match: [experiment, experiments]
tooltip: >-
An attempt to reach desired/better/interesting results during data pipelining
or ML model development. DVC is designed to help [manage
- experiments](/doc/user-guide/experiment-management), having built-in
- mechanisms like the
+ experiments](/doc/start/experiments), having [built-in
+ mechanisms](/doc/user-guide/experiment-management) like the
[run-cache](/doc/user-guide/project-structure/internal-files#run-cache) and
the `dvc experiments` commands (coming in DVC 2.0).
---
diff --git a/content/docs/user-guide/basic-concepts/file-link.md b/content/docs/user-guide/basic-concepts/file-link.md
new file mode 100644
index 0000000000..17f48ea150
--- /dev/null
+++ b/content/docs/user-guide/basic-concepts/file-link.md
@@ -0,0 +1,9 @@
+---
+name: File Linking
+match: [linked]
+tooltip: >-
+ A way to have a file appear in multiple different folders without occupying
+ more physical space on the storage disk. This is both fast and economical. See
+ [large dataset optimization](/doc/user-guide/large-dataset-optimization) and
+ `dvc config cache` for more on file linking.
+---
diff --git a/content/docs/user-guide/basic-concepts/parameter.md b/content/docs/user-guide/basic-concepts/parameter.md
index a48c4793e0..bd1da45ce5 100644
--- a/content/docs/user-guide/basic-concepts/parameter.md
+++ b/content/docs/user-guide/basic-concepts/parameter.md
@@ -5,5 +5,5 @@ tooltip: >-
Pipeline stages (defined in `dvc.yaml`) can depend on specific values inside
an arbitrary YAML, JSON, TOML, or Python file (`params.yaml` by default).
Stages are invalid (considered outdated) when any of their parameter values
- change. See `dvc param`.
+ change. See [`dvc params`](/doc/command-reference/params).
---
diff --git a/content/docs/user-guide/basic-concepts/pipeline.md b/content/docs/user-guide/basic-concepts/pipeline.md
new file mode 100644
index 0000000000..f66383479f
--- /dev/null
+++ b/content/docs/user-guide/basic-concepts/pipeline.md
@@ -0,0 +1,7 @@
+---
+name: Pipeline (DAG)
+match: [DAG, pipeline, 'data pipeline', 'data pipelines']
+tooltip: >-
+ A set of inter-dependent stages. This is also called a [dependency
+ graph](https://en.wikipedia.org/wiki/Directed_acyclic_graph).
+---
diff --git a/content/docs/user-guide/contributing/docs.md b/content/docs/user-guide/contributing/docs.md
index fee50e0fbf..9fcf9afe3c 100644
--- a/content/docs/user-guide/contributing/docs.md
+++ b/content/docs/user-guide/contributing/docs.md
@@ -16,7 +16,7 @@ To contribute documentation, these are the relevant locations:
- [Content](https://github.com/iterative/dvc.org/tree/master/content/docs)
(`content/docs/`):
[Markdown](https://guides.github.com/features/mastering-markdown/) files. One
- file - one page of the documentation.
+ file β one page of the documentation.
- [Images](https://github.com/iterative/dvc.org/tree/master/static/img)
(`static/img/`): Add new images (`.png`, `.svg`, etc.) here. Use them in
Markdown files like this: ``.