Skip to content

Shared cache on NFS Introduced#455

Closed
ryokugyu wants to merge 13 commits into
treeverse:masterfrom
ryokugyu:shared-cache
Closed

Shared cache on NFS Introduced#455
ryokugyu wants to merge 13 commits into
treeverse:masterfrom
ryokugyu:shared-cache

Conversation

@ryokugyu
Copy link
Copy Markdown
Contributor

@ryokugyu ryokugyu commented Jun 25, 2019

fix #103

@shcheklein shcheklein temporarily deployed to dvc-org-pr-455 June 26, 2019 23:19 Inactive
Comment thread static/docs/use-cases/shared-storage-on-nfs.md Outdated
Comment thread static/docs/use-cases/shared-storage-on-nfs.md Outdated
Comment thread static/docs/use-cases/shared-storage-on-nfs.md Outdated
Comment thread static/docs/use-cases/shared-storage-on-nfs.md Outdated
Comment thread static/docs/use-cases/shared-storage-on-nfs.md Outdated
Comment thread static/docs/use-cases/shared-storage-on-nfs.md Outdated
Comment thread static/docs/use-cases/shared-storage-on-nfs.md Outdated
Comment thread static/docs/use-cases/shared-storage-on-nfs.md Outdated
Comment thread static/docs/use-cases/shared-storage-on-nfs.md Outdated
Comment thread static/docs/use-cases/shared-storage-on-nfs.md Outdated
Comment thread static/docs/use-cases/shared-storage-on-nfs.md Outdated
Comment thread static/docs/use-cases/shared-storage-on-nfs.md
link data files from cache to your workspace. It enables symlinks to avoid
copying large files.

`cache.protected true` - to make links `read only` so that we you don't corrupt
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again explain it better that since we are going to use symlinks in this case between cache and workspace (since they are located on different file systems) it important to protect files so that we don't corrupt the cache accidentally. Mention that dvc unprotect should be used in this case, link to the https://dvc.org/doc/user-guide/update-tracked-file

Copy link
Copy Markdown
Contributor Author

@ryokugyu ryokugyu Jul 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shcheklein we need dvc unprotect only when we are writing to NFS directly?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to run dvc unprotect in the client's workspace if we want to edit/rewrite the file that is under DVC control.

Comment thread static/docs/use-cases/shared-storage-on-nfs.md Outdated
Comment thread static/docs/use-cases/shared-storage-on-nfs.md
Now, add first version of the dataset into the DVC cache (this is done once for
a dataset).

```dvc
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's simplify all this workflow. Let's just ask users to SSH into NFS serve machine, do git clone .../project. Move data into project and run dvc add, git commit, git push, (dvc push optional) after that. All the stuff below can be adjusted a bit.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shcheklein i think it will just confuse the user.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cp -r . /project/ is very confusing also. I would say we need to explain the motivation here - we want to avoid copying existing data to a client machine to take it under DVC control.

I also, think git clone protocol is a standard way to collaborate and update different requirements. It's better to do this from the NFS server machine. It'll emphasize that NFS takes care about data.

Copy link
Copy Markdown
Contributor

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really good stuff 🎉 Requires a second iteration to clarify/simplify certain things. Let me know if you need some help with it.

@ryokugyu
Copy link
Copy Markdown
Contributor Author

@shcheklein please review this.

possible and have a workspace restoration/switching speed as instant as
`git checkout` for your code.

With large data files it is better to set the cache directory to external NFS.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: we use it's - it's less formal

`git checkout` for your code.

With large data files it is better to set the cache directory to external NFS.
Not only just it will cache the data faster but also version the data. Suppose,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cache faster - I'm not sure I understand this

of a complete dataset. With `cache directory` set to `NFS server` you would
avoid copying large files from NFS server to the machine and DVC will manage the
links from the workspace to cache.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the paragraph above is good but feels repetitive to the first paragraph in the document and has minor problems. What information you are trying to deliver here? Can you summarize it here in the comments, please? And we'll see how can we improve the text.

@@ -0,0 +1,148 @@
# Shared Storage on NFS

In the modern software development environment, teams are working together on
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

software development -> machine learning

(I even agree that it's software engineering, but it's bette to delineate them for now)

@@ -0,0 +1,148 @@
# Shared Storage on NFS
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to rename it ... it's not about NFS only. It's about any network attached storages. We can do something like:

Share Storage on NAS (NFS)


In the modern software development environment, teams are working together on
same dataset to get the results. It became necessary that data is accessible and
every team member has a same updated dataset. NFS (Network File System) storage
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NAS (NFS is one common example) is widely ...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can also mention something like: "Here we would like to show you how to setup a shared cache on NFS, but the same idea applies to any other NAS"

team member is using the same cache location.

After configuring NFS on both server and client side. Let's create an export
directory on server side where all data will be stored.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's better to use : when you have a code block you are writing about in the sentence

From `/mnt/dataset/` you will be able to access `/storage` directory present in
host server from your local machine.

## Configuring Cache location
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

location -> Location

Next, you can easily get this appear in your workspace by:

```dvc
$ cd /home/user/project/
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before the project path was /project

information on `.dvc` file format, visit
[here](/doc/user-guide/dvc-file-format).

`data` directory will now be a symbolic link to the NFS storage.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth writing something similar to the last paragraph in the introduction to reiterate on why links are important, worth showing an output of the ls -a

Copy link
Copy Markdown
Contributor

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks great! we are almost there. Please check some comments. Also, I'll try to come with an image, similar to what we have for other use cases. Good stuff.

@shcheklein
Copy link
Copy Markdown
Contributor

@ryokugyu any updates on this? :) it's almost done as far as I can tell, would be great to get it merged.

@ryokugyu
Copy link
Copy Markdown
Contributor Author

@ryokugyu any updates on this? :) it's almost done as far as I can tell, would be great to get it merged.

@shcheklein will work on it. Sorry for the delay!

@dashohoxha
Copy link
Copy Markdown
Contributor

I think that the "Mounted DVC Storage" (which is explained on this interactive example: https://katacoda.com/dvc/courses/examples/mounted-storage) is more general than just NFS and it deprecates this one.

@shcheklein
Copy link
Copy Markdown
Contributor

is more general than just NFS

my concern that it's very specific because of SSHFS and it's not emphasized enough that NFS, NAS (whatever else?) is covered

deprecates this one

don't think so. Especially the way interactive tutorials are made - they are extremely dry and do not explain motivation very well, do not explain what is happening behind the scene and what commands are doing.

@dashohoxha
Copy link
Copy Markdown
Contributor

don't think so. Especially the way interactive tutorials are made - they are extremely dry and do not explain motivation very well, do not explain what is happening behind the scene and what commands are doing.

In this case it is just an interactive example (not a tutorial) and it is referenced from a User Guide page: https://dvc-org-pr-784.herokuapp.com/doc/user-guide/data-sharing/mounted-storage
So, the motivation and high level explanations are supposed to be elaborated on the UG page.

@shcheklein
Copy link
Copy Markdown
Contributor

@dashohoxha

In this case it is just an interactive example (not a tutorial) and it is referenced from a User Guide page: https://dvc-org-pr-784.herokuapp.com/doc/user-guide/data-sharing/mounted-storage
So, the motivation and high level explanations are supposed to be elaborated on the UG page.

kk. It just in you initial comment you mentioned only the interactive tutorial and hadn't had enough time to see the UG changes. Will get back to this one when I have time to read the epic PR :)

@ryokugyu ryokugyu closed this Nov 17, 2019
@shcheklein
Copy link
Copy Markdown
Contributor

I think it is still relevant. Unfortunately, no easy way to reopen it now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

guide: using NFS as a remote storage

3 participants