Skip to content

cases: Versioning Data and Models (rewrite)#1747

Merged
shcheklein merged 119 commits into
masterfrom
use-cases-versng
Dec 6, 2020
Merged

cases: Versioning Data and Models (rewrite)#1747
shcheklein merged 119 commits into
masterfrom
use-cases-versng

Conversation

@jorgeorpinel
Copy link
Copy Markdown
Contributor

@jorgeorpinel jorgeorpinel commented Sep 1, 2020

Continuation of #1716 (comment)

UPDATE: Pending stuff

@shcheklein shcheklein temporarily deployed to dvc-landing-use-cases-v-nb9tpb September 1, 2020 05:51 Inactive
@jorgeorpinel

This comment has been minimized.

@jorgeorpinel jorgeorpinel temporarily deployed to dvc-landing-use-cases-v-nb9tpb September 1, 2020 06:01 Inactive
explain why versioning large files is important/a thing
per #1716 (comment)
@jorgeorpinel jorgeorpinel temporarily deployed to dvc-landing-use-cases-v-nb9tpb September 1, 2020 06:02 Inactive
@jorgeorpinel jorgeorpinel temporarily deployed to dvc-landing-use-cases-v-nb9tpb September 2, 2020 00:10 Inactive
@shcheklein
Copy link
Copy Markdown
Contributor

It's good in a sense that we are starting it from scratch 👍

Regarding the intro- it still does not give any sense why would versioning data and models is important (reference to a problem), how DVC solves it and why DVC is the way to go (selling the philosophy).

@jorgeorpinel jorgeorpinel temporarily deployed to dvc-landing-use-cases-v-nb9tpb September 2, 2020 17:25 Inactive
@jorgeorpinel

This comment has been minimized.

@jorgeorpinel jorgeorpinel temporarily deployed to dvc-landing-use-cases-v-nb9tpb September 2, 2020 17:31 Inactive
@jorgeorpinel jorgeorpinel temporarily deployed to dvc-landing-use-cases-v-nb9tpb September 2, 2020 19:22 Inactive
@jorgeorpinel jorgeorpinel temporarily deployed to dvc-landing-use-cases-v-nb9tpb September 2, 2020 23:44 Inactive
Comment thread content/docs/use-cases/versioning-data-and-model-files/index.md Outdated
@jorgeorpinel jorgeorpinel changed the title [WIP] cases: Versioning Data and Model Files (rewrite) cases: Versioning Data and Model Files (rewrite) Sep 3, 2020
@jorgeorpinel jorgeorpinel temporarily deployed to dvc-landing-use-cases-v-nb9tpb September 3, 2020 00:45 Inactive
Comment thread content/docs/use-cases/versioning-data-and-model-files/index.md Outdated
@jorgeorpinel jorgeorpinel temporarily deployed to dvc-landing-use-cases-v-nb9tpb September 3, 2020 00:56 Inactive
Comment thread content/docs/use-cases/versioning-data-and-model-files/index.md Outdated
Copy link
Copy Markdown
Contributor

@dmpetrov dmpetrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a good writen use case. A few suggestions are inline.

Comment thread content/docs/use-cases/versioning-data-and-model-files/index.md Outdated
Comment thread content/docs/use-cases/versioning-data-and-model-files/index.md
Comment thread content/docs/use-cases/versioning-data-and-model-files/index.md Outdated
Comment thread content/docs/use-cases/versioning-data-and-model-files/index.md Outdated

- **Tidy project**: Work with a natural file structure. No need for ad hoc
naming conventions like `data/20190922/labels_v7_final`.
- **Reproducibility**: Restore any project version and find the corresponding
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit too abstract. I'd suggest moving it down in the list.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I moved it down for now.

But I thought this was a pretty key purpose of versioning in general. Maybe we can rewrite it in a less conceptual terms, but I'm not sure how.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can we make it less abstract?

it's an important point to my mind (even more important than a Tidy project (Simple names now))

Copy link
Copy Markdown
Contributor Author

@jorgeorpinel jorgeorpinel Nov 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I put in on the top 😆 for now. But we should come up with a criteria do decide which ones go first... After we decide which benefits to even keep (too many rn probably).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmpetrov 's point stays - if it feels abstract, should we reconsider the description? make Reproducibility and something in bold?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is back around the middle and the other benefits have been generalized more so maybe now they all have a consistent level of abstraction?

e.g. [CML](https://cml.dev/)), specialized patterns like
[data registries](/doc/use-cases/data-registries), and other best practices
that improve productivity and scalability.
- **Data security**: The project history (in Git) is immutable, which allows for
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd separate this into two:

  • Data security - set up security in underlying storage like S3 or NFS....
  • Data compliance - auditing processes to change data and models using regular GitFlow and Pull Requests. Visibility on data and ML model modification using a regular Git - who modified a dataset, when it was happening, who approved the change.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that compliance is a better term, thanks! Updated along the lines you suggested.

Data security: maybe we don't need this benefit? Especially if we decide not to cover storage in this doc (per #1747 (comment) above). I removed it for now.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data compliance should be probably part of the Git flow? or Git flow can be replaced by it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generalized all benefits into 5 now, PTAL. But the problem with Git workflows is that it has too many sub-benefits and it would be a huge paragraph if we include everything. I think Data compliance is a high-level benefit of it's own anyway?

We're even missing data security but per the comments above it sounds like it would be better to list it in https://dvc.org/doc/use-cases/sharing-data-and-model-files (as it has more to do with remote storage).

Comment thread content/docs/use-cases/versioning-data-and-model-files/index.md Outdated
Copy link
Copy Markdown
Contributor Author

@jorgeorpinel jorgeorpinel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another iteration of this is ready for review. Here are 2 known (minor) issues I see:

Comment thread content/docs/use-cases/versioning-data-and-model-files/index.md Outdated
Comment thread content/docs/use-cases/versioning-data-and-model-files/index.md
complicated paths like `data/20190922/labels_v7_final` or for constantly
editing these in source code.

- **Efficient data management**: Use a familiar and cost-effective storage
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A final ask in private to update this a bit

Copy link
Copy Markdown
Contributor

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool 😎 stuff!

@jeremydesroches
Copy link
Copy Markdown
Contributor

@jorgeorpinel would you still like me to review this for SEO-related updates, as you mentioned before?

@jorgeorpinel
Copy link
Copy Markdown
Contributor Author

That would be useful @jeremydesroches but completely up to you. Also at this point, since the rewrite was merged recently, do you think we should wait more weeks to be able to compare search patterns vs the previous version?

I guess giving it a read to make sure the key terms you added in https://github.com/iterative/dvc.org/pull/1806/files#diff-34b5f307478d996f5f763b6e71d8e9689543d9ef99f85d36a505c8ef47e57fee are still there (if possible) would be ideal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A: docs Area: user documentation (gatsby-theme-iterative) C: cases Content of /doc/use-cases p1-important Active priorities to deal within next sprints

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants