Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
227 changes: 115 additions & 112 deletions public/static/docs/command-reference/diff.md
Original file line number Diff line number Diff line change
@@ -1,41 +1,53 @@
# diff

Show changes between commits in the <abbr>DVC repository</abbr>, or between a
commit and the <abbr>workspace</abbr>. The comparison can be narrowed down to
specific target files/directories tracked by DVC.
Show added, modified, or deleted DVC-tracked files and directories between
commits in the <abbr>DVC repository</abbr>, or between a commit and the
workspace.

## Synopsis

```usage
usage: dvc diff [-h] [-q | -v] [-t TARGET] a_ref [b_ref]
usage: dvc diff [-h] [-q | -v]
[--show-json] [--show-hash]
[a_rev] [b_rev]
Comment thread
shcheklein marked this conversation as resolved.

positional arguments:
a_rev Old Git commit to compare (defaults to HEAD)
b_rev New Git commit to compare (defaults to the
current workspace)
a_rev Old Git commit to compare (defaults to HEAD)
b_rev New Git commit to compare (defaults to the current workspace)
Comment thread
shcheklein marked this conversation as resolved.
```

## Description

Given two commit hashes, branch or tag names, etc.
([references](https://git-scm.com/docs/revisions)) `a_ref` and `b_ref`, this
command shows a comparative summary of basic statistics related to files tracked
by DVC: how many files were deleted/changed, and the file size differences.
Prints a list of files and directories added, modified, deleted in a Git commit
`b_rev` as compared to another Git commit `a_rev`. Both `a_rev` and `b_rev`
accept any [Git revision](https://git-scm.com/docs/gitrevisions) - branch or tag
name, Git commit hash, etc.

> Note that `dvc diff` does not show the line-to-line comparisons like
> `git diff` or [GNU `diff`](https://www.gnu.org/software/diffutils/) can. This
> is because the data data tracked by DVC comes in many formats such as
> structured text, binary blobs, etc. For an example on how to create
> line-to-line text file comparison, refer to
> [issue #770](https://github.com/iterative/dvc/issues/770#issuecomment-512693256).
It defaults to comparing the current workspace and the last commit (`HEAD`), if
arguments `a_rev` and `b_rev` are not specified.

Options `--show-json` and `--show-hash` can be used to modify format and details
of the output produced. See the [Options](#options) and (Examples)(#examples)
sections below for more details.

`dvc diff` does not have an effect when the repository is not tracked by Git,
for example when `dvc init` was used with the `--no-scm` option.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dvc diff does not have an effect... when dvc init was used with the --no-scm option

Somewhat unrelated but actually, I checked and this isn't correct. You can create a git repo, then a dvc init --no-scm in it, dvc add a data file, commit (A); change and dvc add the data file again, commit (B), and dvc diff HEAD^ will work.

Is this buggy behavior? If not, we just need to rewrite the paragraph above to something more correct.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it feels more or less correct to me - the main thing here is when the repository is not tracked by Git. In your example it is tracked by Git, right? unless I'm missing something.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's tracked by Git, but I initialized the DVC project with dvc init --no-scm. The paragraph above reads ...`dvc diff` does not have an effect... for example when `dvc init` was used with the `--no-scm` option but it does have an effect, as long as there is a Git repo underneath. Maybe just remove the part about --no-scm in the paragraph?

This comment was marked as off-topic.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine, I still don't see a problem. It's intuitively clear what does it mean, even if implementation is not 100% correct and allows to have a mix of Git and --no-scm.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm OK but just removing that last incorrect statement wouldn't hurt either.


> Note that current `dvc diff` implementation does not show the line-to-line
> comparison among the files in each revision, like `git diff` or
> [GNU `diff`](https://www.gnu.org/software/diffutils/) can. This is because the
> data data tracked by DVC can come in many possible formats e.g. structured
> text, or binary blobs, etc. For an example on how to create line-to-line text
> file comparison, refer to this
> [comment](https://github.com/iterative/dvc/issues/770#issuecomment-512693256).

## Options

- `-t TARGET`, `--target TARGET` - path to a data file or directory to limit
diff for.
- `--show-json` - generate output in JSON format. Usually needed to integrate
DVC into scripts.

- `--show-hash` - print file and directory hash values along with their path.
Useful for debug purposes.

- `-h`, `--help` - prints the usage/help message, and exit.

Expand All @@ -46,148 +58,139 @@ for example when `dvc init` was used with the `--no-scm` option.

## Examples

For these examples we can use the chapters in our
[Get Started](/doc/get-started) section, up to
[Add Files](/doc/get-started/add-files).
For these examples we can use the [Get Started](/doc/get-started) project.

<details>

### Click and expand to setup example
### Click and expand to setup the project to run examples

Start by cloning our example repo if you don't already have it. Then move into
the repo and checkout the
[3-add-file](https://github.com/iterative/example-get-started/releases/tag/3-add-file)
tag, corresponding to the [Add Files](/doc/get-started/add-files) _Get Started_
chapter:
Start by cloning our example repo if you don't already have it:

```dvc
$ git clone https://github.com/iterative/example-get-started
$ cd example-get-started
$ git checkout 3-add-file
```

Download the precomputed data using:
Download data using:

```dvc
$ dvc pull
$ dvc fetch -T
Preparing to download data from 'https://remote.dvc.org/get-started'
...
```

</details>
The `-T` flag passed to `dvc fetch` makes sure we have all the data files
related to all existing tags in the repo. You may see the available tags of our
example repo [here](https://github.com/iterative/example-get-started/tags).

## Example: Previous commit in the same branch
</details>

The minimal `dvc diff`, run without arguments, defaults to comparing DVC-tacked
files between `HEAD` (current Git commit) and the current <abbr>workspace</abbr>
(uncommitted changes, if any).
## Example: Checking workspace changes

To see the difference between the very previous commit of the project and the
workspace, we can use `HEAD^` as `a_ref`:
The minimal `dvc diff`, run without arguments, defaults to comparing DVC-tracked
files between `HEAD` (last Git commit) and the current <abbr>workspace</abbr>
(uncommitted changes, if any):
Comment thread
shcheklein marked this conversation as resolved.

```dvc
$ dvc diff HEAD^
dvc diff from df613bc to ed10968

diff for 'data/data.xml'
+data/data.xml with md5 a304afb96060aad90176268345e10355

added file with size 37.9 MB
$ dvc diff
```

## Example: Specific targets across Git commits

We can base this example in the [Metrics](/doc/get-started/metrics) and
[Compare Experiments](/doc/get-started/compare-experiments) chapters of our _Get
Started_ section, that describe different experiments to produce the `model.pkl`
file. Our example repository has the `bigrams-experiment` and
`baseline-experiment`
[tags](https://github.com/iterative/example-get-started/tags) respectively to
reference these experiments.
## Example: Comparing workspace with arbitrary commits

<details>

### Click and expand to setup example
### Click and expand to setup the example

Having followed the previous example's setup, move into the
`example-get-started/` directory. Then make sure that you have the latest code
and data with the following commands.
Let's checkout the
[3-add-file](https://github.com/iterative/example-get-started/releases/tag/3-add-file)
tag, corresponding to the [Add Files](/doc/get-started/add-files) _Get Started_
chapter, right after we added `data.xml` file with DVC:

```dvc
$ git checkout master
$ dvc fetch -T
$ git checkout 3-add-file
$ dvc pull
```

The `-T` flag passed to `dvc fetch` makes sure we have all the data files
related to all existing tags in the repo. You take a look at the
[available tags](https://github.com/iterative/example-get-started/tags) of our
example repo.

</details>

To see the difference in `model.pkl` among these tags, we can run the following
command.
To see the difference between the very previous commit of the project and the
workspace, we can use `HEAD^` as `a_ref`:

```dvc
$ dvc diff -t model.pkl baseline-experiment bigrams-experiment
dvc diff from bc1722d to 8c1169d
$ dvc diff HEAD^
Added:
data/data.xml

diff for 'model.pkl'
-model.pkl with md5 a664896
+model.pkl with md5 3863d0e
...
files summary: 1 added, 0 deleted, 0 modified
```

The output from this command confirms that there's a difference in the
`model.pkl` file between the 2 Git commits (tags `baseline-experiment` and
`bigrams-experiment`) we indicated.
## Example: Comparing tags or branches

### What about directories?
<details>

Unlike Git, DVC features controlling entire directories without having to add
each individual file. See `dvc add` without `--recursive` for example. `dvc run`
can track entire directories (when these are specified as command dependencies
or <abbr>outputs</abbr>).
### Click and expand to setup the example

We can use `dvc diff` to check for changes in a directory by specifying the
directory as the target (with option `-t`). Note that we skip the `b_ref`
argument this time, that defaults to `HEAD`.
Our example repository has the `baseline-experiment` and `bigrams-experiment`
[tags](https://github.com/iterative/example-get-started/tags) tags, that
reference two different modeling experiments.

Having followed the example's setup, move into the `example-get-started/`
directory. Then make sure that you have the latest code and data with the
following commands:

```dvc
$ dvc diff -t data/features baseline-experiment
dvc diff from bc1722d to 8c1169d
$ git checkout master
$ dvc checkout
```

diff for 'data/features'
-data/features with md5 3338d2c.dir
+data/features with md5 42c7025.dir
</details>

0 files not changed, 0 files modified, 0 files added,
0 files deleted, size was increased by 2.9 MB
```dvc
$ dvc diff baseline-experiment bigrams-experiment
Modified:
auc.metric
data/features/
data/features/test.pkl
data/features/train.pkl
model.pkl

files summary: 0 added, 0 deleted, 4 modified
```

## Example: Confirming that a target has not changed
The output from this command confirms that there's a difference in 4 files
between the tags `baseline-experiment` and `bigrams-experiment`.

Let's use our example repo once again, that has several
[available tags](https://github.com/iterative/example-get-started/tags) for
conveniency. The `5-preparation` tag corresponds to the
[Connect Code and Data](/doc/get-started/connect-code-and-data) chapter of our
_Get Started_ section, where the `dvc run` command is used to create a
`prepare.dvc` stage file. This DVC-file tracks the `data/prepared` directory
<abbr>output</abbr>.
## Example: Using different output formats

```dvc
$ dvc diff -t data/prepared 5-preparation
dvc diff from 3deeec1 to 8c1169d

diff for 'data/prepared'
-data/prepared with md5 6836f79.dir
+data/prepared with md5 6836f79.dir
Let's use the same command as above, but with JSON output and including hash
values:

2 files not changed, 0 files modified, 0 files added,
0 files deleted, size was not changed
```dvc
$ dvc diff --show-json --show-hash \
baseline-experiment bigrams-experiment
```

The command above checks whether there have been any changes to the
`data/prepared` directory after the `5-preparation` tag (since the `b_ref` is
`HEAD` by default). The output tells us that there have been no changes to that
directory (or to any other file).
It outputs:

```json
{
"added": [],
"deleted": [],
"modified": [
...{
"path": "data/features/",
"hash": {
"old": "3338d2c21bdb521cda0ba4add89e1cb0.dir",
"new": "42c7025fc0edeb174069280d17add2d4.dir"
}
},
...{
"path": "model.pkl",
"hash": {
"old": "43630cce66a2432dcecddc9dd006d0a7",
"new": "662eb7f64216d9c2c1088d0a5e2c6951"
}
}
]
}
```