diff --git a/src/Documentation/sidebar.json b/src/Documentation/sidebar.json index 6324d1368e..03eadb3d26 100644 --- a/src/Documentation/sidebar.json +++ b/src/Documentation/sidebar.json @@ -36,12 +36,14 @@ "files": [ "data-and-model-files-versioning.md", "share-data-and-model-files.md", - "multiple-data-scientists-on-a-single-machine.md" + "multiple-data-scientists-on-a-single-machine.md", + "shared-storage-on-nfs.md" ], "labels": { "data-and-model-files-versioning.md": "Data & Model Files Versioning", "share-data-and-model-files.md": "Share Data & Model Files", - "multiple-data-scientists-on-a-single-machine.md": "Shared Development Machine" + "multiple-data-scientists-on-a-single-machine.md": "Shared Development Machine", + "shared-storage-on-nfs.md": "Shared Storage on NFS" } }, { diff --git a/static/docs/use-cases/shared-storage-on-nfs.md b/static/docs/use-cases/shared-storage-on-nfs.md new file mode 100644 index 0000000000..cbe34f2152 --- /dev/null +++ b/static/docs/use-cases/shared-storage-on-nfs.md @@ -0,0 +1,148 @@ +# Shared Storage on NFS + +In the modern software development environment, teams are working together on +same dataset to get the results. It became necessary that data is accessible and +every team member has a same updated dataset. NFS (Network File System) storage +is widely used for storing and sharing files on the network. This allows you to +have better resource utilization such as ability to store large datasets on a +single host machine. + +With DVC, you can easily setup a shared cache storage on the NFS server that +will allow your team to share and store data for your projects effectively as +possible and have a workspace restoration/switching speed as instant as +`git checkout` for your code. + +With large data files it is better to set the cache directory to external NFS. +Not only just it will cache the data faster but also version the data. Suppose, +we have a dataset with 1 million images. With DVC, we can have multiple versions +of a dataset without affecting each other work and without creating duplicates +of a complete dataset. With `cache directory` set to `NFS server` you would +avoid copying large files from NFS server to the machine and DVC will manage the +links from the workspace to cache. + +## Preparation + +First configure NFS server and client machine, following this +[link](https://vitux.com/install-nfs-server-and-client-on-ubuntu/). + +In order to make it work on a shared server, after configuring NFS server and +client we need to setup a shared cache location for your projects, so that every +team member is using the same cache location. + +After configuring NFS on both server and client side. Let's create an export +directory on server side where all data will be stored. + +```dvc +$ mkdir -p /storage +``` + +You will have to make sure that the directory has proper permissions setup, so +that every one on your team can read and write to it and can access cache files +written by others. The most straightforward way to do that is to make sure that +you and your colleagues are members of the same group (e.g. 'users') and that +your shared directory is owned by that group and has respective permissions. + +Let's create a mount point of client side. + +```dvc +$ mkdir -p /mnt/dataset/ +``` + +From `/mnt/dataset/` you will be able to access `/storage` directory present in +host server from your local machine. + +## Configuring Cache location + +After mounting the shared directory on client side. Assuming project code is +present in `/project`. Let's initialize a `dvc repo`. + +```dvc +$ cd /project/ +$ git init +$ dvc init +$ git add .dvc .gitignore +$ git commit . -m "initialize DVC" +``` + +With `dvc init`, we initialized a DVC repository. For more information, visit +[here](/doc/get-started/initialize). + +**Tell DVC to use the directory we've set up as an external cache location by +running:** + +```dvc +$ dvc cache dir /mnt/dataset/storage +``` + +`dvc cache dir /path/to/cache/directory` - sets cache directory location. + +```dvc +$ dvc config cache.type "reflink,symlink,hardlink,copy" +``` + +`cache.type "reflink,symlink,hardlink,copy"` - link type that DVC should use to +link data files from cache to your workspace. It enables symlinks to avoid +copying large files. For more information, vist +[here](/doc/user-guide/large-dataset-optimization). + +```dvc +$ dvc config cache.protected true +``` + +`cache.protected true` - to make links `read only` so that we you don't corrupt +data accidentally present in the workspace. Since, we are using `symlinks` +between the cache and local workspace because both are located on different +filesystem. + +Also, let Git know about the changes we have done. + +```dvc +$ git add .dvc .gitignore +$ git commit . -m "DVC cache location updated" +``` + +## Add data to DVC cache + +Now, add first version of the dataset into the DVC cache (this is done once for +a dataset). + +```dvc +$ cd /mnt/dataset/ +$ cp -r . /project/ +$ cd /project +$ mv /mnt/dataset/project_data/ data/ +$ dvc add data +``` + +After copying the data, we have moved the data that is present in the +`/mnt/dataset/project_data/` to `./data` directory. This is only done once for a +dataset. + +`dvc add data` will take files in `data` directory under DVC control. By default +an added file is committed to the DVC cache. After `dvc add` dvc will +`unprotect` all the data. For more information, visit +[here](/doc/user-guide/update-tracked-file). + +Now, commit changes to `.dvc/config` and push them to your git remote: + +```dvc +$ git add data.dvc .gitignore +$ git commit . -m "add first version of the dataset" +$ git tag -a "v1.0" -m "dataset v1.0" +$ git push origin HEAD +$ git push origin v1.0 +``` + +Next, you can easily get this appear in your workspace by: + +```dvc +$ cd /home/user/project/ +$ git pull +$ dvc checkout +``` + +After `git pull`, you will be able to see a `data.dvc` file. To see more +information on `.dvc` file format, visit +[here](/doc/user-guide/dvc-file-format). + +`data` directory will now be a symbolic link to the NFS storage.