Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion static/docs/commands-reference/fetch.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,7 @@ specified in DVC-files currently in the workspace are considered by `dvc fetch`
of a DVC-file ([experiments](/doc/get-started/experiments)), not just the
current one.

- `-T`, `--all-tags` - fetch cache for all tags. Similar to `-a` above
- `-T`, `--all-tags` - fetch cache for all tags. Similar to `-a` above.

- `--show-checksums` - show checksums instead of file names when printing the
download progress.
Expand Down
8 changes: 4 additions & 4 deletions static/docs/commands-reference/import-url.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,10 +23,10 @@ In some cases it's convenient to add a data file or directory from a remote
location into the workspace, such that it will be automatically updated (by
`dvc repro`) when the external data source changes. Examples:

- a remote system may produce occasional data files that are used in other
projects;
- a batch process running regularly updates a data file to import; and
- a shared dataset on a remote storage that is managed and updated outside DVC.
- A remote system may produce occasional data files that are used in other
projects.
- A batch process running regularly updates a data file to import.
- A shared dataset on a remote storage that is managed and updated outside DVC.

The `dvc import-url` command helps the user create such an external data
dependency. The `url` argument specifies the external location of the data to be
Expand Down
12 changes: 6 additions & 6 deletions static/docs/commands-reference/index.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,17 @@
# Using DVC Commands

DVC is a command-line tool. The typical use case for DVC goes as follows
DVC is a command-line tool. The typical use case for DVC goes as follows:

- In an existing Git repository, initialize a DVC repository with `dvc init`,
- In an existing Git repository, initialize a DVC repository with `dvc init`.
- Copy source code files for modeling into the repository and convert the files
into DVC data files with `dvc add` command;
into DVC data files with `dvc add` command.
- Process raw data files through your data processing and modeling code using
the `dvc run` command;
the `dvc run` command.
- Use `--outs` option to specify `dvc run` command outputs which will be
converted to DVC data files after the code runs;
converted to DVC data files after the code runs.
- Clone a git repo with the code of your ML application pipeline. However, this
will not copy your DVC cache. Use
[data remotes](/doc/commands-reference/remote) and `dvc push` to share the
cache (data);
cache (data).
- Use `dvc repro` to quickly reproduce your pipeline on a new iteration, after
your data item files or source code of your ML application are modified.
4 changes: 2 additions & 2 deletions static/docs/commands-reference/install.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,9 +46,9 @@ The installed Git hook automates executing `dvc push`.
## Installed Git hooks

- Git `pre-commit` hook executes `dvc status` before `git commit` to inform the
user about the workspace status;
user about the workspace status.
- Git `post-checkout` hook executes `dvc checkout` after `git checkout` to
automatically synchronize the data files with the new workspace state;
automatically synchronize the data files with the new workspace state.
- Git `pre-push` hook executes `dvc push` before `git push` to upload files and
directories under DVC control to remote.

Expand Down
1 change: 0 additions & 1 deletion static/docs/commands-reference/pull.md
Original file line number Diff line number Diff line change
Expand Up @@ -200,4 +200,3 @@ the `model.p.dvc` stage occurs later, its data was not pulled.
Then we ran `dvc pull` specifying the last stage, `model.p.dvc`, and its data
was downloaded. Finally, we ran `dvc pull` with no options to make sure that all
data was already pulled with the previous commands.

Comment thread
dnabanita7 marked this conversation as resolved.
1 change: 0 additions & 1 deletion static/docs/commands-reference/push.md
Original file line number Diff line number Diff line change
Expand Up @@ -339,4 +339,3 @@ Data and pipelines are up to date.

And running `dvc status --cloud` verifies that indeed there are no more files to
upload to the remote cache.

Comment thread
dnabanita7 marked this conversation as resolved.
2 changes: 1 addition & 1 deletion static/docs/commands-reference/run.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ pipeline.
dependencies can be specified like this: `-d data.csv -d process.py`. Usually,
each dependency is a file or a directory with data, or a code file, or a
configuration file. DVC also supports certain
[external dependencies](/doc/user-guide/external-dependencies)
[external dependencies](/doc/user-guide/external-dependencies).

DVC builds a computation graph and this list of dependencies is a way to
connect different stages with each other. When you run `dvc repro` to
Expand Down
8 changes: 4 additions & 4 deletions static/docs/commands-reference/status.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,13 +71,13 @@ outputs described in it.
commands like `dvc commit` or `dvc repro`, `dvc run` should be run to update
the file. Possible states are:

- _new_: output exists in workspace, but there is no corresponding checksum
- _new_: Output exists in workspace, but there is no corresponding checksum
calculated and saved in the DVC-file for this output yet.
- _modified_: output or dependency exists in workspace, but the corresponding
- _modified_: Output or dependency exists in workspace, but the corresponding
checksum in the DVC-file is not up to date.
- _deleted_: output or dependency does not exist in workspace, but still
- _deleted_: Output or dependency does not exist in workspace, but still
referred in the DVC-file.
- _not in cache_: output exists in workspace and the corresponding checksum in
- _not in cache_: Output exists in workspace and the corresponding checksum in
the DVC-file is up to date, but there is no corresponding <abbr>cache</abbr>
entry.

Expand Down
13 changes: 7 additions & 6 deletions static/docs/commands-reference/version.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,11 +61,11 @@ The detail of `Binary` depends on the way DVC was downloading and
- **`Binary: True`** - displayed when DVC is downloaded/installed as one of:

- Debian package (`.deb`) - file used to install packages in several Linux
distributions, like Ubuntu.
distributions, like Ubuntu
- Red Hat package (`.rpm`) - file used to install packages in some Linux based
distributions, such as Fedora, CentOS, etc.
- PKG file (`.pkg`) - file used to install apps on macOS.
- Windows executable (`.exe`) - file used to install applications on Windows.
- PKG file (`.pkg`) - file used to install apps on macOS
- Windows executable (`.exe`) - file used to install applications on Windows

These downloads are available from our [home page](/). They ultimately contain
a binary bundle, which is the executable version of a software program,
Expand All @@ -76,11 +76,11 @@ The detail of `Binary` depends on the way DVC was downloading and
- **`Binary: False`** - shown when DVC is downloaded and installed from:

- [DVC's GitHub repository](https://github.com/iterative/dvc) - where core
source code is hosted.
source code is hosted
- [The Python Package Index (PyPI)](https://pypi.org/project/dvc/) - source
code is stored as a Python package.
code is stored as a Python package
- [Homebrew package manager](https://github.com/iterative/homebrew-dvc) (for
macOS systems) - source code is stored as Python package.
macOS systems) - source code is stored as Python package

This method of installation involves downloading DVC source code, and
following certain setup instructions (See the
Expand Down Expand Up @@ -125,3 +125,4 @@ Platform: Linux-4.15.0-50-generic-x86_64-with-debian-buster-sid
Binary: False
Filesystem type (workspace): ('ext4', '/dev/sdb3')
```

4 changes: 2 additions & 2 deletions static/docs/get-started/agenda.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,9 +29,9 @@ datasets and you want to:
- Capture and save those <abbr>data artifacts</abbr> the same way we capture
code
Comment thread
dnabanita7 marked this conversation as resolved.
- Track and switch between different versions of the data easily
- Being able to answer the question of how data artifacts (e.g. ML models) were
- Be able to answer the question of how data artifacts (e.g. ML models) were
built in the first place
- Being able to compare them
- Be able to compare them
- Bring best practices to your team and get everyone on the same page

Then you are in a good place! Click the `Next` button below to start ↘
2 changes: 1 addition & 1 deletion static/docs/understanding-dvc/how-it-works.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,4 +90,4 @@
-r-------- 2 501 staff 273M Jan 27 03:48 Posts-test.tsv
```

8. DVC works on Mac, Linux ,and Windows.
8. DVC works on Mac, Linux, and Windows.
146 changes: 73 additions & 73 deletions static/docs/understanding-dvc/related-technologies.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,119 +9,119 @@ process.

1. **Git**. The difference is:

- DVC extends Git by introducing the concept of _data files_ - large files
that should NOT be stored in a Git repository but still need to be tracked
and versioned.
- DVC extends Git by introducing the concept of _data files_ large files
that should NOT be stored in a Git repository but still need to be tracked
and versioned.

2. **Workflow management tools** (pipelines and DAGs): Airflow, Luigi, etc. The
differences are:

- DVC is focused on data science and modeling. As a result, DVC pipelines are
lightweight, easy to create and modify. However, DVC lacks pipeline
execution features like execution monitoring, execution error handling, and
recovering.
- DVC is focused on data science and modeling. As a result, DVC pipelines are
lightweight, easy to create and modify. However, DVC lacks pipeline
execution features like execution monitoring, execution error handling, and
recovering.

- DVC is purely a command line tool without a graphical user interface (GUI)
and doesn't run any daemons or servers. Nevertheless, DVC can generate
images with pipeline and experiment workflow visualization.
- DVC is purely a command line tool without a graphical user interface (GUI)
and doesn't run any daemons or servers. Nevertheless, DVC can generate
images with pipeline and experiment workflow visualization.

3. **Experiment management software** today is mostly designed for enterprise
usage. An open-sourced experimentation tool example: http://studio.ml/. The
differences are:

- DVC uses Git as the underlying platform for experiment tracking instead of
a web application.
- DVC uses Git as the underlying platform for experiment tracking instead of
a web application.

- DVC doesn't need to run any services. No graphical user interface as a
result, but we expect some GUI services will be created on top of DVC.
- DVC doesn't need to run any services. No graphical user interface as a
result, but we expect some GUI services will be created on top of DVC.

- DVC has transparent design:
[meta files and directories](/doc/user-guide/dvc-files-and-directories)
(including the data cache) have a human-readable format and can be easily
reused by external tools.
- DVC has transparent design:
[meta files and directories](/doc/user-guide/dvc-files-and-directories)
(including the data cache) have a human-readable format and can be easily
reused by external tools.

4. **Git workflows** and Git usage methodologies such as Gitflow. The
differences are:

- DVC supports a new experimentation methodology that integrates easily with
a Git workflow. A separate branch should be created for each experiment,
with a subsequent merge of this branch if it was successful.
- DVC supports a new experimentation methodology that integrates easily with
a Git workflow. A separate branch should be created for each experiment,
with a subsequent merge of this branch if it was successful.

- DVC innovates by giving experimenters the ability to easily navigate
through past experiments without recomputing them.
- DVC innovates by giving experimenters the ability to easily navigate
through past experiments without recomputing them.

5) **Makefile** (and it's analogues). The differences are:

- DVC utilizes a DAG:
- DVC utilizes a DAG:

- The DAG is defined by [DVC-files](/doc/user-guide/dvc-file-format) (with
file names `<file>.dvc` or `Dvcfile`).
- The DAG is defined by [DVC-files](/doc/user-guide/dvc-file-format) (with
file names `<file>.dvc` or `Dvcfile`).

- One DVC-file defines one node in the DAG. All DVC-files in a repository
make up a single pipeline (think a single Makefile). All DVC-files (and
corresponding pipeline commands) are implicitly combined through their
inputs and outputs, to simplify conflict resolving during merges.
- One DVC-file defines one node in the DAG. All DVC-files in a repository
make up a single pipeline (think a single Makefile). All DVC-files (and
corresponding pipeline commands) are implicitly combined through their
inputs and outputs, to simplify conflict resolving during merges.

- DVC provides a simple command `dvc run CMD` to generate a DVC-file
automatically based on the provided command, dependencies, and outputs.
- DVC provides a simple command `dvc run CMD` to generate a DVC-file
automatically based on the provided command, dependencies, and outputs.

- File tracking:
- File tracking:

- DVC tracks files based on checksum (md5) instead of file timestamps. This
helps avoid running into heavy processes like model re-training when you
checkout a previous, trained version of a modeling code (Makefile will
retrain the model).
- DVC tracks files based on checksum (md5) instead of file timestamps. This
helps avoid running into heavy processes like model re-training when you
checkout a previous, trained version of a modeling code (Makefile will
retrain the model).

- DVC uses file timestamps and inodes for optimization. This allows DVC to
avoid recomputing all dependency files checksum, which would be highly
problematic when working with large files (10 GB+).
- DVC uses file timestamps and inodes for optimization. This allows DVC to
avoid recomputing all dependency files checksum, which would be highly
problematic when working with large files (10 GB+).

6. **Git-annex**. The differences are:

- DVC uses the idea of storing the content of large files (that you don't
want to see in your Git repository) in a local key-value store and use file
symlinks instead of the actual files.
- DVC uses the idea of storing the content of large files (that you don't
want to see in your Git repository) in a local key-value store and use file
symlinks instead of the actual files.

- DVC can use reflinks\* or hardlinks (depending on the system) instead of
symlinks to improve performance and make the user experience better.
- DVC can use reflinks\* or hardlinks (depending on the system) instead of
symlinks to improve performance and make the user experience better.

- DVC optimizes checksum calculation.
- DVC optimizes checksum calculation.

- Git-annex is a datafile-centric system whereas DVC is focused on providing
a workflow for machine learning and reproducible experiments. When a DVC or
Git-annex repository is cloned via git clone, data files won't be copied to
the local machine as file content is stored in separate data remotes.
However, [DVC-files](/doc/user-guide/dvc-file-format) (which provide the
reproducible workflow) are always included in the cloned Git repository and
hence can be recreated locally with minimal effort.
- Git-annex is a datafile-centric system whereas DVC is focused on providing
a workflow for machine learning and reproducible experiments. When a DVC or
Git-annex repository is cloned via git clone, data files won't be copied to
the local machine as file content is stored in separate data remotes.
However, [DVC-files](/doc/user-guide/dvc-file-format) (which provide the
reproducible workflow) are always included in the cloned Git repository and
hence can be recreated locally with minimal effort.

- DVC is not fundamentally bound to Git, having the option of changing the
repository format.
- DVC is not fundamentally bound to Git, having the option of changing the
repository format.

7) **Git-LFS** (Large File Storage). The differences are:

- DVC does not require special Git servers like Git-LFS demands. Any cloud
storage like S3, GCS, or on-premises SSH server can be used as a backend
for datasets and models, no additional databases, servers or infrastructure
are required.
- DVC does not require special Git servers like Git-LFS demands. Any cloud
storage like S3, GCS, or on-premises SSH server can be used as a backend
for datasets and models, no additional databases, servers or infrastructure
are required.

- DVC is not fundamentally bound to Git, having the option of changing the
repository format.
- DVC is not fundamentally bound to Git, having the option of changing the
repository format.

- DVC does not add any hooks to Git by default. To checkout data files, the
`dvc checkout` command has to be run after each `git checkout` and
`git clone` command. It gives more granularity on managing data and code
separately. Hooks could be configured to make workflow simpler.
- DVC does not add any hooks to Git by default. To checkout data files, the
`dvc checkout` command has to be run after each `git checkout` and
`git clone` command. It gives more granularity on managing data and code
separately. Hooks could be configured to make workflow simpler.

- DVC attempts to use reflinks\* and has other
[file linking options](/docs/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache).
This way the `dvc checkout` command does not actually copy data files from
cache to the workspace, as copying files is a heavy operation for large
files (30 GB+).
- DVC attempts to use reflinks\* and has other
[file linking options](/docs/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache).
This way the `dvc checkout` command does not actually copy data files from
cache to the workspace, as copying files is a heavy operation for large
files (30 GB+).

- `git-lfs` was not made with data science scenarios in mind, so it does not
provide related features (e.g. pipelines, metrics), and thus Github has a
limit of 2 GB per repository.
- `git-lfs` was not made with data science scenarios in mind, so it does not
provide related features (e.g. pipelines, metrics), and thus Github has a
limit of 2 GB per repository.

---

Expand Down
Loading