diff --git a/content/docs/command-reference/exp/init.md b/content/docs/command-reference/exp/init.md index 0cba098b25..02184b6ce3 100644 --- a/content/docs/command-reference/exp/init.md +++ b/content/docs/command-reference/exp/init.md @@ -3,6 +3,9 @@ Codify project using [DVC metafiles](/doc/user-guide/project-structure) to run [experiments](/doc/user-guide/experiment-management). +> Requires a DVC repository, created with `git init` and +> `dvc init`. + ## Synopsis ```usage @@ -32,6 +35,11 @@ training of machine learning models. This command is intended to be a quick way to start running experiments. To create more complex stages and pipelines, use `dvc stage add`. +> 📖 More context in [Experiments Overview]. + +[experiments overview]: + /doc/user-guide/experiment-management/experiments-overview + ### The `command` argument The `command` argument is optional, if you are using `--interactive` mode. The diff --git a/content/docs/sidebar.json b/content/docs/sidebar.json index 4bafa68591..899b93f3ae 100644 --- a/content/docs/sidebar.json +++ b/content/docs/sidebar.json @@ -148,6 +148,7 @@ "slug": "experiment-management", "source": "experiment-management/index.md", "children": [ + "experiments-overview", "running-experiments", "comparing-experiments", "sharing-experiments", diff --git a/content/docs/user-guide/experiment-management/cleaning-experiments.md b/content/docs/user-guide/experiment-management/cleaning-experiments.md index bd48c09f97..2415553449 100644 --- a/content/docs/user-guide/experiment-management/cleaning-experiments.md +++ b/content/docs/user-guide/experiment-management/cleaning-experiments.md @@ -2,7 +2,9 @@ Although DVC uses minimal resources to keep track of the experiments, they may clutter tables and the workspace. DVC allows to remove specific experiments from -the workspace or delete all not-yet-persisted experiments at once. +the workspace or delete all not-yet-[persisted] experiments at once. + +[persisted]: /doc/user-guide/experiment-management/persisting-experiments ## Removing specific experiments diff --git a/content/docs/user-guide/experiment-management/experiments-overview.md b/content/docs/user-guide/experiment-management/experiments-overview.md new file mode 100644 index 0000000000..82251c59e0 --- /dev/null +++ b/content/docs/user-guide/experiment-management/experiments-overview.md @@ -0,0 +1,72 @@ +# DVC Experiments Overview + +DVC Experiments are captured automatically by DVC when [run]. Each experiment +creates and tracks a variation of your data science project based on the changes +in your workspace. + +Experiments preserve a connection to the latest commit in the current branch +(Git `HEAD`) as their parent or _baseline_, but do not form part of the regular +Git tree (unless you make them [persistent]). This prevents bloating your repo +with temporary commits and branches. + +[run]: /doc/user-guide/experiment-management/running-experiments + +
+ +### ⚙️ How does DVC track experiments? + +Experiments are custom [Git references](/blog/experiment-refs) (found in +`.git/refs/exps`) with one or more commits based on `HEAD`. These commits are +hidden and not checked out by DVC. Note that these are not pushed to Git remotes +by default either (see `dvc exp push`). + +Note that DVC Experiments require a unique name to identify them. DVC will +usually auto-generate one by default, such as `exp-bfe64` (based on the +experiment's hash). A custom name can be set instead, using the `--name`/`-n` +option of `dvc exp run`. These names can be used to reference experiments in +other `dvc exp` subcommands. + +
+ +## Basic workflow + +`dvc exp` commands let you automatically track a variation of a project version +(the baseline). You can create independent groups of experiments this way, as +well as review, compare, and restore them later. The basic workflow goes like +this: + +- Modify hyperparameters or other dependencies (input data, source code, + commands to execute, etc.). Leave these changes un-committed in Git. +- [Run experiments][run] with `dvc exp run` (instead of `repro`). The results + are reflected in your workspace, and tracked automatically. +- Review and [compare] experiments with `dvc exp show` or `dvc exp diff`, using + [metrics](/doc/command-reference/metrics) to identify the best one(s). Repeat + 🔄 +- Make certain experiments [persistent] by committing their results to Git. This + lets you repeat the process from that point. + +[compare]: /doc/user-guide/experiment-management/comparing-experiments +[persistent]: /doc/user-guide/experiment-management/persisting-experiments + +## Initialize DVC Experiments on any project + +To use DVC Experiments you need a DVC project with a minimal +structure and configuration. To avoid having to bootstrap DVC manually, the +`dvc exp init` command lets you quickly onboard an existing project to the DVC +Experiments workflow. + +It will create a simple `dvc.yaml` metafile, which codifies your planned +experiments. This includes the locations for expected dependencies +(data, parameters, source code) and outputs (ML models, +metrics, etc.). These assume [sane defaults] but can be customized +with the options of `dvc exp init`. + +💡 We recommend adding the `-i` flag to use its `--interactive` mode. This will +ask you how to run the experiments, and guide you through customizing the +aforementioned locations (optional). + +You can review the resulting changes to your repo (and commit them to Git) to +begin using DVC Experiments. Now you can move on to [running experiments][run] +(next). + +[sane defaults]: /doc/command-reference/exp/init#description diff --git a/content/docs/user-guide/experiment-management/index.md b/content/docs/user-guide/experiment-management/index.md index ca008dad2f..b501fd0a16 100644 --- a/content/docs/user-guide/experiment-management/index.md +++ b/content/docs/user-guide/experiment-management/index.md @@ -1,113 +1,90 @@ # Experiment Management -_New in DVC 2.0 (see `dvc version`)_ - -Data science and ML are iterative processes that require a large number of -attempts to reach a certain level of a metric. Experimentation is part of the -development of data features, hyperspace exploration, deep learning -optimization, etc. DVC helps you codify and manage all of your -experiments, supporting these main approaches: - -1. Create [experiments](#experiments) that derive from your latest project - version without having to track them manually. DVC does that automatically, - letting you list and compare them. The best ones can be made persistent, and - the rest archived. -2. Place in-code [checkpoints](#checkpoints-in-source-code) that mark a series - of variations, forming a deep experiment. DVC helps you capture them at - runtime, and manage them in batches. -3. Make experiments or checkpoints [persistent](#persistent-experiments) by - committing them to your repository. Or create these versions - from scratch like typical project changes. - - At this point you may also want to consider the different - [ways to organize](#organization-patterns) experiments in your project (as - Git branches, as folders, etc.). - -DVC also provides specialized features to codify and analyze experiments. -[Parameters](/doc/command-reference/params) are simple values you can tweak in a -human-readable text file, which cause different behaviors in your code and -models. On the other end, [metrics](/doc/command-reference/metrics) (and +Data science and machine learning are iterative processes that require a large +number of attempts to reach a certain level of a metric. Experimentation is part +of the development of data features, hyperspace exploration, deep learning +optimization, etc. + +Some of DVC's base features already help you codify and analyze experiments. +[Parameters](/doc/command-reference/params) are simple values in a formatted +text file which you can tweak and use in your code. On the other end, +[metrics](/doc/command-reference/metrics) (and [plots](/doc/command-reference/plots)) let you define, visualize, and compare -meaningful measures for the experimental results. - -> 👨‍💻 See [Get Started: Experiments](/doc/start/experiments) for a hands-on -> introduction to DVC experiments. +quantitative measures of your results. -## Experiments +## Experimentation in DVC -`dvc exp` commands let you automatically track a variation to an established -[data pipeline](/doc/command-reference/dag). You can create multiple isolated -experiments this way, as well as review, compare, and restore them later, or -roll back to the baseline. The basic workflow goes like this: - -- Modify stage parameters or other dependencies (e.g. input data, - source code) of committed stages. -- Use `dvc exp run` (instead of `repro`) to execute the pipeline. The results - are reflected in your workspace, and tracked automatically. -- Use [metrics](/doc/command-reference/metrics) to identify the best - experiment(s). -- Visualize, compare experiments with `dvc exp show` or `dvc exp diff`. Repeat - 🔄 -- Use `dvc exp apply` to roll back to the best one. -- Make the selected experiment persistent by committing its results to Git. This - cleans the slate so you can repeat the process. - -## Checkpoints in source code +_New in DVC 2.0 (see `dvc version`)_ -To track successive steps in a longer experiment, you can register checkpoints -from your code at runtime. This allows you, for example, to track the progress -in deep learning techniques such as evolving neural networks. +DVC experiment management features build on top of base DVC features to form a +comprehensive framework to organize, execute, manage, and share ML experiments. +They support support these main approaches: -This kind of experiments track a series of variations (the checkpoints) and its -execution can be stopped and resumed as needed. You interact with them using -`dvc exp run` and its `--rev`, `--reset` options (see also the `checkpoint` -field in `dvc.yaml` `outs`). +- Compare parameters and metrics of existing project versions (for example + different Git branches) against each other or against new, uncommitted results + in your workspace. One tool to do so is `dvc exp diff`. -> 📖 To learn more, see the dedicated -> [Checkpoints](/doc/user-guide/experiment-management/checkpoints) guide. +- [Run and capture] multiple experiments (derived from any project version as + baseline) without polluting your Git history. DVC tracks them for you, letting + you compare and share them. 📖 More info in the [Experiments + Overview][experiments]. -## Persistent experiments +- Generate [checkpoints] at runtime to keep track of the internal progress of + deeper experiments. DVC captures [live metrics](/doc/dvclive), which you can + manage in batches. -When your experiments are good enough to save or share, you may want to store -them persistently as Git commits in your repository. +[run and capture]: /doc/user-guide/experiment-management/running-experiments +[experiments]: /doc/user-guide/experiment-management/experiments-overview +[checkpoints]: /doc/user-guide/experiment-management/checkpoints -Whether the results were produced with `dvc repro` directly, or after a -`dvc exp` workflow (refer to previous sections), the `dvc.yaml` and `dvc.lock` -pair in the workspace will codify the experiment as a new project -version. The right outputs (including -[metrics](/doc/command-reference/metrics)) should also be present, or available -via `dvc checkout`. +> 👨‍💻 See [Get Started: Experiments](/doc/start/experiments) for a hands-on +> introduction to DVC experiments. ### Organization patterns -DVC takes care of arranging `dvc exp` experiments and the data -cache under the hood. But when it comes to full-blown persistent -experiments, it's up to you to decide how to organize them in your project. -These are the main alternatives: +It's up to you to decide how to organize completed experiments. These are the +main alternatives: - **Git tags and branches** - use the repo's "time dimension" to distribute your experiments. This makes the most sense for experiments that build on each other. Helpful if the Git [revisions](https://git-scm.com/docs/revisions) can be easily visualized, for example with tools [like GitHub](https://docs.github.com/en/github/visualizing-repository-data-with-graphs/viewing-a-repositorys-network). + - **Directories** - the project's "space dimension" can be structured with directories (folders) to organize experiments. Useful when you want to see all your experiments at the same time (without switching versions) by just exploring the file system. + - **Hybrid** - combining an intuitive directory structure with a good repo branching strategy tends to be the best option for complex projects. - Completely independent experiments live in separate directories, while their - progress can be found in different branches. + Completely independent experiments live in separate directories (and can be + generated with [`foreach` stages], for example), while their progress can be + found in different branches. + +- **Labels** - in general, you can record experiments in a separate system and + structure them using custom labeling. This is typical in dedicated experiment + tracking tools. A possible problem with this approach is that it's easy to + lose the connection between your project history and the experiments logged. + +DVC takes care of arranging `dvc exp` experiments and the data +cache under the hood so there's no need to decide on the above +until your experiments are made [persistent]. + +[`foreach` stages]: + /doc/user-guide/project-structure/pipelines-files#foreach-stages +[persistent]: /doc/user-guide/experiment-management/persisting-experiments -## Automatic log of stage runs (run-cache) +## Run Cache: Automatic Log of Stage Runs -Every time you `dvc repro` pipelines or `dvc exp run` experiments, DVC logs the -unique signature of each stage run (to `.dvc/cache/runs` by default). If it -never happened before, the stage command(s) are executed normally. Every +Every time you [reproduce](/doc/command-reference/repro) a pipeline with DVC, it +logs the unique signature of each stage run (in `.dvc/cache/runs` by default). +If it never happened before, the stage command(s) are executed normally. Every subsequent time a [stage](/doc/command-reference/run) runs under the same conditions, the previous results can be restored instantly, without wasting time or computing resources. ✅ This built-in feature is called run-cache and it can -dramatically improve performance. It's enabled out-of-the-box (but can be -disabled with the `--no-run-cache` command option). +dramatically improve performance. It's enabled out-of-the-box (can be disabled), +which means DVC is already saving all of your tests and experiments behind the +scene. But there's no easy way to explore it. diff --git a/content/docs/user-guide/experiment-management/persisting-experiments.md b/content/docs/user-guide/experiment-management/persisting-experiments.md index 669a340808..3314e36d29 100644 --- a/content/docs/user-guide/experiment-management/persisting-experiments.md +++ b/content/docs/user-guide/experiment-management/persisting-experiments.md @@ -1,11 +1,9 @@ # Persisting Experiments -DVC runs experiments outside of the Git stage/commit cycle for quick iteration. -When your experiments are good enough to save or share, you may want to store -them persistently as Git commits in your repository. - -In this section, we describe how to bring them to the standard Git workflow with -`dvc exp branch` and `dvc exp apply`. +DVC Experiments run outside of the regular Git workflow for faster iteration and +to avoid polluting your repository's history. Once experiments are +good enough to keep or distribute, you may want to store them persistently as +Git commits. ## Create a Git branch from an experiment @@ -73,7 +71,7 @@ $ dvc exp show --include-params=my_param The results found in the workspace are shown in the respective row. When you want to bring another experiment to the workspace, you can reference it using -it's name or ID, e.g.: +it's name, e.g.: ```dvc $ dvc exp apply exp-e6c97 diff --git a/content/docs/user-guide/experiment-management/running-experiments.md b/content/docs/user-guide/experiment-management/running-experiments.md index 0518bcc98e..0edca46ab0 100644 --- a/content/docs/user-guide/experiment-management/running-experiments.md +++ b/content/docs/user-guide/experiment-management/running-experiments.md @@ -1,8 +1,8 @@ # Running Experiments -We explain how DVC codifies and executes experiments, setting their parameters, -using multiple jobs to run them in parallel, and running them in queues, among -other details. +We explain how to execute DVC Experiments, setting their parameters, using +multiple jobs to run them in parallel, and running them in queues, among other +details. > 📖 If this is the first time you are introduced into data science > experimentation, you may want to check the basics in @@ -231,7 +231,7 @@ Note that Git-ignored files/dirs are explicitly excluded from queued/temp runs to avoid committing unwanted files into Git (e.g. once successful experiments are [persisted]). -[persisted]: /doc/user-guide/experiment-management#persistent-experiments +[persisted]: /doc/user-guide/experiment-management/persisting-experiments > 💡 To include untracked files, stage them with `git add` first (before > `dvc exp run`) and `git reset` them afterwards.