-
Notifications
You must be signed in to change notification settings - Fork 409
cases: Versioning Data and Models (rewrite) #1747
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
119 commits
Select commit
Hold shift + click to select a range
39d4400
Merge branch 'master' into use-cases
jorgeorpinel 87264eb
cases: [WIP] befin rewriting Versioning:
jorgeorpinel 8bdae1d
cases: give some sense of why versioning data and models is important
jorgeorpinel 1679c59
guide: why DVC is the way to Version data (sell philosophy)
jorgeorpinel a794600
Merge branch 'master' into use-cases-versng
jorgeorpinel ab35693
cases: add example section explaining why data versionig is
jorgeorpinel 5ed0d75
cases: wrap up Versioning full draft
jorgeorpinel c291558
cases: rename demo section in Versioning, roll back checkout img, et al.
jorgeorpinel a3008c8
Merge branch 'master' into use-cases-versng
jorgeorpinel ca062b2
Merge branch 'master' into use-cases-versng
jorgeorpinel 7a4cec6
Merge branch 'master' into use-cases-versng
jorgeorpinel 9e564e3
cases: some more versioning updates
jorgeorpinel 9d5dca1
cases: shorten versioning intro
jorgeorpinel c188146
cases: add bullet list of Versioning advantages
jorgeorpinel 1f4f2f0
cases: shorten Why DVC section in Versioning
jorgeorpinel c554b67
Merge branch 'master' into use-cases-versng
jorgeorpinel de07dfa
term: data modeling -> data engineering
jorgeorpinel daa99a0
cases: make advantages section in Data Registry (consistency)
jorgeorpinel 02568e7
cases: make separate Versioned storage section
jorgeorpinel 79d9b0b
cases: rewrite intro and other changes to Versioning
jorgeorpinel 8ddc976
cases: cover gap between Versioning and (remote) storage, link to GS
jorgeorpinel ae8e7ad
Merge branch 'master' into use-cases-versng
jorgeorpinel 09d1eab
use-cases: reapply SEO keyword changes from #1806
jorgeorpinel 7ea83bf
cases: make p about storage less overlapping to previous one
jorgeorpinel e90d332
cases: add paragraph about versioning advantages before DVC's motivation
jorgeorpinel 5f47377
cases: simplify lists of advantages in Versioning (and Data Reg)
jorgeorpinel f36d10b
cases: limitation->constraint (to avoid a redundancy)
jorgeorpinel b9ea7ec
guide: move DVC is not Git! from use cases to What is DVC?
jorgeorpinel 3fdddf2
cases: ~~Summary of~~ Advantages (H2)
jorgeorpinel ce647f5
cases: rewrite parts of the DVC motivation paragraphs in Versioning
jorgeorpinel bc7018b
Merge branch 'master' into use-cases-versng
jorgeorpinel 62a34db
cases: improve vrsng intro and dedupe bullet lists
jorgeorpinel 81d848b
cases: rename Advantages sectino of vrsng
jorgeorpinel 1898ccf
cases: expand on How it looks (vrsng) with focus on workspace
jorgeorpinel c6d969c
Merge branch 'master' into use-cases-versng
jorgeorpinel f447b14
guide: improve DVC is not Git! section
jorgeorpinel 384218d
cases: rename Versioning use case (why "Files"?)
jorgeorpinel 7facdf7
cases: rewrite (again) the intro to vrsng
jorgeorpinel 8f3cb70
Merge branch 'master' into use-cases-versng
jorgeorpinel c0b6b58
cases: improve versioning intro (more coherent)
jorgeorpinel 17f80b8
cmd: quick term update
jorgeorpinel 018c3f3
Merge branch 'master' into use-cases-versng
jorgeorpinel b63bc7f
cases: update links to Versioning use case
jorgeorpinel 0ec7350
cases: refine Versioning intro, add proposed figure
jorgeorpinel 68619f1
cases: summarize, simplify, focus on the essence, et al.
jorgeorpinel 12bc7ed
cases: add redirect for new Versioning use case location
jorgeorpinel 78426d1
cases: merge How it looks + Version control sections
jorgeorpinel bc7ff0a
cases: simplify versioning-data-and-models#how-it-looks
jorgeorpinel 53b65c2
Merge branch 'master' into use-cases-versng
jorgeorpinel 579cc5b
Revert "redirect for new Versioning use case URL" 12bc7ed and
jorgeorpinel fb7265c
cases: rewrite intro to improve motivation and
jorgeorpinel c9d0444
cases: update Why DVC and benefits list
jorgeorpinel f49cce6
cases: actually revert URL change from recent commit
jorgeorpinel 5b95d36
cases: more updates to the benefits bullets in Versioning
jorgeorpinel aeb860e
cases: rewrite How it looks (& feels) section
jorgeorpinel 2c7e2ea
cases: remove non-essential info. from How it looks section of Versio…
jorgeorpinel da9390a
Merge branch 'master' into use-cases-versng
jorgeorpinel 30ad4e7
cases: simplify How it looks per David and some of Ivan's feedback
jorgeorpinel c5e34ce
cases: remove H2s temporarily, simplify benefits bullet list, et al.
jorgeorpinel 04e42cb
cses: rewrite benefit bullets and simplify how it feels section
jorgeorpinel 531071a
cases: make bullet list into paragraph temp.
jorgeorpinel 40f09df
cases: wrap up Vrsng? (text)
jorgeorpinel 8fcd2e6
cases: hardcode colums in How it feels section of Vrsng
jorgeorpinel 669f9f5
cache: simplify it's structure explanation and add CAS term (from Vrs…
jorgeorpinel 67c6beb
guide: revert changes to this section for now
jorgeorpinel 4329b60
cases: polish latest iteration of Versioning use case
jorgeorpinel 8ca6ef1
Merge branch 'master' into use-cases-versng
jorgeorpinel 8b81ac3
Merge branch 'master' into use-cases-versng
jorgeorpinel 66b0829
cases: next iteration of Versioning page
jorgeorpinel 49adf55
Merge branch 'master' into use-cases-versng
jorgeorpinel aa6c43e
cases: polishing my last iteration of the Vsng page
jorgeorpinel 3c61ea7
remove a bunch of info from Vrsng to simplify again
jorgeorpinel b74e687
cases: minor iteration of Vrsng, pending benefits list
jorgeorpinel f02c1a7
guide: updates to What is DVC
jorgeorpinel 63970bc
cmd: roll-back unrelated changes (stashed elsewhere for now)
jorgeorpinel e6ce632
cases: work on benefits of Vrsng
jorgeorpinel 88ff11a
cases: more work on benefits of Vrsng
jorgeorpinel aeacb1f
cases: remove emojis; improve benefits list; add refs to other cases
jorgeorpinel 3956590
cses: clarify about cache and about metafiles in Versioning
jorgeorpinel eeccb68
cases: simplify p about roll back/fwds; split benefit about data regs
jorgeorpinel 7c22613
Merge branch 'master' into use-cases-versng
jorgeorpinel 00b88e1
cases: change BEFORE to be similar to the top fig.
jorgeorpinel 4436600
Merge branch 'master' into use-cases-versng
jorgeorpinel 702d619
cases: another iteration of Versioning
jorgeorpinel 00dc2d6
cases: simplify Versioning again
jorgeorpinel 5edf502
cases: improvements on Vrsng per direct feedback
jorgeorpinel 21fce9b
cases: more updates to latest text and figures
jorgeorpinel c0142c2
Merge branch 'master' into use-cases-versng
jorgeorpinel 6cab1f3
cases: rephrase Vrsng benefits list
jorgeorpinel 24a00cd
Merge branch 'master' into use-cases-versng
jorgeorpinel b1d75c7
cases: revert to previous draft fig
jorgeorpinel 8307583
cases: update 2nd figure draft, and reorder codification p
jorgeorpinel 79a071a
cases: rework Vrsng benefits and
jorgeorpinel 953c16b
cases: draft What's Next section added with advanced scenarios for Vrsng
jorgeorpinel 52ea945
cases: simplify 2nd figure
jorgeorpinel 2b0d183
Merge branch 'master' into use-cases-versng
jorgeorpinel 98b3135
cases: make first Vrsg figure shorter
jorgeorpinel 87b8333
cases: merge advanced scenarios with benefits list
jorgeorpinel 59fc3f9
cases: roll back changes to Data Regs
jorgeorpinel 7677d25
cases: improvements per Dmitry's feedback...
jorgeorpinel a6f3352
cases: train_feats > features in figures for Vrsng
jorgeorpinel 9644a38
cases: rename Vrng Tutorial label in nav (use emoji)
jorgeorpinel 0e47977
cases: explain simple file naming a bit more
jorgeorpinel 8bd9d96
cases: Vrng copy edits
jorgeorpinel 1686639
cases: add efficient data mgmt benefit
jorgeorpinel 4ab5ed4
cases: reorder Vrsg benefits list
jorgeorpinel 1b48db2
cases: rewrite file naming and data mgmt benefits of Vrsg
jorgeorpinel 0fbf200
cases: expand story to cover storage and data management
jorgeorpinel c7617ef
cases: generalized Vrsg benefits
jorgeorpinel aa62d9d
Merge branch 'master' into use-cases-versng
jorgeorpinel c14a271
cases: separate data mgmt from versioning (through codification) in Vrsg
jorgeorpinel 020fb3f
Make note about other guides, refs, and tutorial (Vrng)
jorgeorpinel 1a3a58c
cases: emphasize Simplicity benefit of Vrng is the opposite of "compl…
jorgeorpinel f6598bd
cases: another rewrite of text and benefits
jorgeorpinel 98d97c7
Merge branch 'master' into use-cases-versng
jorgeorpinel 6a75449
cases: copy edits to latest Vrng iteration, and append next steps par…
jorgeorpinel d6697dc
Merge branch 'master' into use-cases-versng
jorgeorpinel 2ccd32e
cases: another iteration of Versioning use case
jorgeorpinel 0d093b8
cases: clarify data mgmt is for data in Vrng benefits
jorgeorpinel File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
218 changes: 88 additions & 130 deletions
218
content/docs/use-cases/versioning-data-and-model-files/index.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,130 +1,88 @@ | ||
| # Versioning Data and Model Files | ||
|
|
||
| DVC enables versioning large files and directories such as datasets, data | ||
| science features, and machine learning models using Git, but without storing the | ||
| contents in Git. | ||
|
|
||
| This is achieved by saving information about the data in special | ||
| [metafiles](/doc/user-guide/dvc-files-and-directories) that replace the data in | ||
| the repository. These can be versioned with regular Git workflows (branches, | ||
| pull requests, etc.) | ||
|
|
||
| To actually store the data, DVC uses a built-in <abbr>cache</abbr>, and supports | ||
| synchronizing it with various types of | ||
| [remote storage](/doc/command-reference/remote). This allows for easy data and | ||
| model versioning, storage, and sharing — right alongside code. | ||
|
|
||
|  _Code and data flows in DVC_ | ||
|
|
||
| In this basic use case, DVC is a better alternative to | ||
| [Git-LFS / Git-annex](/doc/user-guide/related-technologies) and to ad-hoc | ||
| scripts used to manage ML <abbr>artifacts</abbr> (training data, models, etc.) | ||
| on cloud storage. DVC doesn't require special services, and works with | ||
| on-premises storage (e.g. SSH, NAS) as well as any major cloud storage provider | ||
| (Amazon S3, Microsoft Azure, Google Drive, | ||
| [among others](/doc/command-reference/remote/add#supported-storage-types)). | ||
|
|
||
| > For hands-on experience, we recommend following the | ||
| > [versioning tutorial](/doc/use-cases/versioning-data-and-model-files). | ||
|
|
||
| ## DVC is not Git! | ||
|
|
||
| DVC metafiles such as `dvc.yaml` and `.dvc` files serve as placeholders to track | ||
| data files and directories for versioning (among other purposes). They point to | ||
| specific data contents in the <abbr>cache</abbr>, providing the ability to store | ||
| multiple data versions out-of-the-box. | ||
|
|
||
| Full-fledged | ||
| [version control](https://git-scm.com/book/en/v2/Getting-Started-About-Version-Control) | ||
| is left for Git and its hosting platforms (e.g. GitHub, GitLab) to handle. These | ||
| are designed for source code management (SCM) however, and thus ill-equipped to | ||
| support data science needs. That's where DVC comes in: with its built-in data | ||
| <abbr>cache</abbr>, reproducible [pipelines](/doc/start/data-pipelines), among | ||
| several other novel features (see [Get Started](/doc/start/) for a primer.) | ||
|
|
||
| ## Track data and models for versioning | ||
|
|
||
| Let's say you have an empty <abbr>DVC repository</abbr> and put a dataset of | ||
| images in the `images/` directory. You can start tracking it with `dvc add`. | ||
| This generates a `.dvc` file, which can be committed to Git in order to save the | ||
| project's version: | ||
|
|
||
| ```dvc | ||
| $ ls images/ | ||
| 0001.jpg 0002.jpg 0003.jpg 0004.jpg ... | ||
|
|
||
| $ dvc add images/ | ||
|
|
||
| $ git add images.dvc .gitignore | ||
| $ git commit -m "Track images dataset with DVC." | ||
| ``` | ||
|
|
||
| DVC's also allows to define the processes that build artifacts based on tracked | ||
| data, such as an ML model, by writing a simple `dvc.yaml` file that connects the | ||
| pieces together: | ||
|
|
||
| > `dvc.yaml` files can be written manually or generated with `dvc run`. | ||
|
|
||
| ```yaml | ||
| stages: | ||
| train: | ||
| cmd: python train.py images/ | ||
| deps: | ||
| - images | ||
| outs: | ||
| - model.pkl | ||
| ``` | ||
|
|
||
| > See [Data Pipelines](/doc/start/data-pipelines) for a comprehensive intro to | ||
| > this feature. | ||
|
|
||
| `dvc repro` can now execute the `train` stage for you. DVC will track all of its | ||
| outputs (`outs`) automatically. Let's do that, and commit this project version: | ||
|
|
||
| ```dvc | ||
| $ dvc repro | ||
| Running stage 'train' with command: | ||
| python train.py images/ | ||
| Updating lock file 'dvc.lock' | ||
| ... | ||
|
|
||
| $ git add dvc.yaml dvc.lock .gitignore | ||
| $ git commit -m "Train model via DVC." | ||
| $ git tag -a "v1.0" -m "Fist model" # We'll use this soon ;) | ||
| ``` | ||
|
|
||
| > See also `dvc.lock`. | ||
|
|
||
| ## Switching versions | ||
|
|
||
| After iterating on this process and producing several versions, you can combine | ||
| `git checkout` and `dvc checkout` to perform full or partial | ||
| <abbr>workspace</abbr> restorations. | ||
|
|
||
|  _Code and data checkout_ | ||
|
|
||
| > Note that `dvc install` enables auto-checkouts of data after `git checkout`. | ||
|
|
||
| A full checkout brings the whole <abbr>project</abbr> back to a previous version | ||
| — code, dataset and model files all match each other: | ||
|
|
||
| ```dvc | ||
| $ git checkout v1.0 | ||
| $ dvc checkout | ||
| M images | ||
| M model.pkl | ||
| ``` | ||
|
|
||
| However, we can checkout certain parts only, for example if we want to keep the | ||
| latest source code and model versions, but rewind to the previous version of the | ||
| dataset: | ||
|
|
||
| ```dvc | ||
| $ git checkout v1.0 images.dvc | ||
| $ dvc checkout images.dvc | ||
| M images | ||
| ``` | ||
|
|
||
| DVC [optimizes](/doc/user-guide/large-dataset-optimization) this operation by | ||
| avoiding copying files each time, so checking out data is quick even if you are | ||
| versioning large data files. | ||
| # Versioning Data and Models | ||
|
|
||
| Data science teams face data management questions around versions of data and | ||
| machine learning models. How do we keep track of changes in data, source code, | ||
| and ML models together? What's the best way to organize and store variations of | ||
| these files and directories? | ||
|
|
||
|  _Exponential complexity of data science projects_ | ||
|
|
||
| Another problem in the field has to do with bookkeeping: being able to identify | ||
| past data inputs and processes to understand their results, for knowledge | ||
| sharing, or for debugging. | ||
|
|
||
| **Data Version Control** (DVC) lets you capture the versions of your data and | ||
| models in | ||
| [Git commits](https://git-scm.com/book/en/v2/Git-Basics-Recording-Changes-to-the-Repository), | ||
| while storing them on-premises or in cloud storage. It also provides a mechanism | ||
| to switch between these different data contents. The result is a single history | ||
| for data, code, and ML models that you can traverse — a proper journal of your | ||
| work! | ||
|
|
||
|  _DVC matches the right versions of data, code, | ||
| and models for you 💘._ | ||
|
|
||
| DVC enables data _versioning through codification_. You write simple | ||
| [metafiles](/doc/user-guide/dvc-files-and-directories) once, describing what | ||
| datasets, ML artifacts, etc. to track. This metadata can be put in Git in lieu | ||
| of large files. Now you can use DVC to create | ||
| [snapshots](/doc/command-reference/add) of the data, | ||
| [restore](/doc/command-reference/checkout) previous versions, | ||
| [reproduce](/doc/command-reference/repro) experiments, record evolving | ||
| [metrics](/doc/command-reference/metrics), and more! | ||
|
|
||
| 👩💻 **Intrigued?** Try our | ||
| [versioning tutorial](/doc/use-cases/versioning-data-and-model-files/tutorial) | ||
| to learn how DVC looks and feels firsthand. | ||
|
|
||
| As you use DVC, unique versions of your data files and directories are | ||
| [cached](dvc-files-and-directories#structure-of-the-cache-directory) in a | ||
| systematic way (preventing file duplication). The working datastore is separated | ||
| from your <abbr>workspace</abbr> to keep the project light, but stays connected | ||
| via file | ||
| [links](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache) | ||
| handled automatically by DVC. | ||
|
|
||
| Benefits of our approach include: | ||
|
|
||
| - **Lightweight**: DVC is a | ||
| [free](https://github.com/iterative/dvc/blob/master/LICENSE), open-source | ||
| [command line](/doc/command-reference) tool that doesn't require databases, | ||
| servers, or any other special services. | ||
|
|
||
| - **Consistency**: Keep your projects readable with stable file names — they | ||
| don't need to change because they represent variable data. No need for | ||
| complicated paths like `data/20190922/labels_v7_final` or for constantly | ||
| editing these in source code. | ||
|
|
||
| - **Efficient data management**: Use a familiar and cost-effective storage | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A final ask in private to update this a bit |
||
| solution for your data and models (e.g. SFTP, S3, HDFS, | ||
| [etc.](/doc/command-reference/remote/add#supported-storage-types)) — free from | ||
| Git hosting | ||
| [constraints](https://docs.github.com/en/free-pro-team@latest/github/managing-large-files/what-is-my-disk-quota). | ||
| DVC [optimizes](/doc/user-guide/large-dataset-optimization) storing and | ||
| transferring large files. | ||
|
|
||
| - **Collaboration**: Easily distribute your project development and share its | ||
| data [internally](/doc/use-cases/shared-development-server) and | ||
| [remotely](/doc/use-cases/sharing-data-and-model-files), or | ||
| [reuse](/doc/start/data-access) it in other places. | ||
|
|
||
| - **Data compliance**: Review data modification attempts as Git | ||
| [pull requests](https://www.dummies.com/web-design-development/what-are-github-pull-requests/). | ||
| Audit the project's immutable history to learn when datasets or models were | ||
| approved, and why. | ||
|
|
||
| - **GitOps**: Connect your data science projects with the Git-powered universe. | ||
| Git workflows open the door to advanced tools such as continuous integration | ||
| (like [CML](https://cml.dev/) CI/CD), specialized patterns such as | ||
| [data registries](/doc/use-cases/data-registries), and other best practices. | ||
|
jorgeorpinel marked this conversation as resolved.
|
||
|
|
||
| In summary, data science and ML are iterative processes where the lifecycles of | ||
| data, models, and code happen at different paces. DVC helps you manage, and | ||
| enforce them. | ||
|
|
||
| And this is just the beginning. DVC supports multiple advanced features | ||
| out-of-the-box: Build, run, and versioning | ||
| [data pipelines](/doc/command-reference/dag), | ||
| [manage experiments](/doc/start/experiments) effectively, and more. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.