cases: Versioning Data and Models (rewrite)#1747
Conversation
This comment has been minimized.
This comment has been minimized.
b7378a9 to
df7a5f6
Compare
explain why versioning large files is important/a thing per #1716 (comment)
df7a5f6 to
87264eb
Compare
d400921 to
87264eb
Compare
|
It's good in a sense that we are starting it from scratch 👍 Regarding the intro- it still does not give any sense why would versioning data and models is important (reference to a problem), how DVC solves it and why DVC is the way to go (selling the philosophy). |
This comment has been minimized.
This comment has been minimized.
6421771 to
8bdae1d
Compare
is important, and how it looks with DVC per #1747 (comment)
dmpetrov
left a comment
There was a problem hiding this comment.
It is a good writen use case. A few suggestions are inline.
|
|
||
| - **Tidy project**: Work with a natural file structure. No need for ad hoc | ||
| naming conventions like `data/20190922/labels_v7_final`. | ||
| - **Reproducibility**: Restore any project version and find the corresponding |
There was a problem hiding this comment.
A bit too abstract. I'd suggest moving it down in the list.
There was a problem hiding this comment.
OK, I moved it down for now.
But I thought this was a pretty key purpose of versioning in general. Maybe we can rewrite it in a less conceptual terms, but I'm not sure how.
There was a problem hiding this comment.
How can we make it less abstract?
it's an important point to my mind (even more important than a Tidy project (Simple names now))
There was a problem hiding this comment.
OK I put in on the top 😆 for now. But we should come up with a criteria do decide which ones go first... After we decide which benefits to even keep (too many rn probably).
There was a problem hiding this comment.
@dmpetrov 's point stays - if it feels abstract, should we reconsider the description? make Reproducibility and something in bold?
There was a problem hiding this comment.
This is back around the middle and the other benefits have been generalized more so maybe now they all have a consistent level of abstraction?
| e.g. [CML](https://cml.dev/)), specialized patterns like | ||
| [data registries](/doc/use-cases/data-registries), and other best practices | ||
| that improve productivity and scalability. | ||
| - **Data security**: The project history (in Git) is immutable, which allows for |
There was a problem hiding this comment.
I'd separate this into two:
- Data security - set up security in underlying storage like S3 or NFS....
- Data compliance - auditing processes to change data and models using regular GitFlow and Pull Requests. Visibility on data and ML model modification using a regular Git - who modified a dataset, when it was happening, who approved the change.
There was a problem hiding this comment.
I agree that compliance is a better term, thanks! Updated along the lines you suggested.
Data security: maybe we don't need this benefit? Especially if we decide not to cover storage in this doc (per #1747 (comment) above). I removed it for now.
There was a problem hiding this comment.
Data compliance should be probably part of the Git flow? or Git flow can be replaced by it?
There was a problem hiding this comment.
I generalized all benefits into 5 now, PTAL. But the problem with Git workflows is that it has too many sub-benefits and it would be a huge paragraph if we include everything. I think Data compliance is a high-level benefit of it's own anyway?
We're even missing data security but per the comments above it sounds like it would be better to list it in https://dvc.org/doc/use-cases/sharing-data-and-model-files (as it has more to do with remote storage).
and update benefits
jorgeorpinel
left a comment
There was a problem hiding this comment.
Another iteration of this is ready for review. Here are 2 known (minor) issues I see:
| complicated paths like `data/20190922/labels_v7_final` or for constantly | ||
| editing these in source code. | ||
|
|
||
| - **Efficient data management**: Use a familiar and cost-effective storage |
There was a problem hiding this comment.
A final ask in private to update this a bit
|
@jorgeorpinel would you still like me to review this for SEO-related updates, as you mentioned before? |
|
That would be useful @jeremydesroches but completely up to you. Also at this point, since the rewrite was merged recently, do you think we should wait more weeks to be able to compare search patterns vs the previous version? I guess giving it a read to make sure the key terms you added in https://github.com/iterative/dvc.org/pull/1806/files#diff-34b5f307478d996f5f763b6e71d8e9689543d9ef99f85d36a505c8ef47e57fee are still there (if possible) would be ideal. |
UPDATE: Pending stuff