From 0b4f581490e86f3e26294e24fe09ec1684f32f7d Mon Sep 17 00:00:00 2001 From: Himanshu Nailwal Date: Mon, 24 Jun 2019 08:22:55 +0530 Subject: [PATCH 01/13] shared-storage-on-nfs added --- src/Documentation/sidebar.json | 6 +- .../docs/use-cases/shared-storage-on-nfs.md | 130 ++++++++++++++++++ 2 files changed, 134 insertions(+), 2 deletions(-) create mode 100644 static/docs/use-cases/shared-storage-on-nfs.md diff --git a/src/Documentation/sidebar.json b/src/Documentation/sidebar.json index 6324d1368e..03eadb3d26 100644 --- a/src/Documentation/sidebar.json +++ b/src/Documentation/sidebar.json @@ -36,12 +36,14 @@ "files": [ "data-and-model-files-versioning.md", "share-data-and-model-files.md", - "multiple-data-scientists-on-a-single-machine.md" + "multiple-data-scientists-on-a-single-machine.md", + "shared-storage-on-nfs.md" ], "labels": { "data-and-model-files-versioning.md": "Data & Model Files Versioning", "share-data-and-model-files.md": "Share Data & Model Files", - "multiple-data-scientists-on-a-single-machine.md": "Shared Development Machine" + "multiple-data-scientists-on-a-single-machine.md": "Shared Development Machine", + "shared-storage-on-nfs.md": "Shared Storage on NFS" } }, { diff --git a/static/docs/use-cases/shared-storage-on-nfs.md b/static/docs/use-cases/shared-storage-on-nfs.md new file mode 100644 index 0000000000..b740f17a70 --- /dev/null +++ b/static/docs/use-cases/shared-storage-on-nfs.md @@ -0,0 +1,130 @@ +# Shared Storage on NFS + +In the modern software development environment, teams are working together on +same dataset to get the results. It became necessary that data is accessible and +every team member has a same updated dataset. For this example, we will be using +NFS (Network File System) for storing and sharing files on the network. + +With DVC, we need not to copy the dataset on our local machine everytime when +new data is added to dataset. We can set the `cache directory` on NFS server. +The cached data will be present in the NFS server, which in turn will be fast to +access and process requests faster. + +With large data files it is better to set the cache directory to NFS. Not only +just it will cache the data faster but also version the data. Suppose, we have a +dataset with 1 million images. With DVC, we can have multiple versions of a +dataset without affecting each other work and without creating duplicates of a +complete dataset. With `cache directory` set to `NFS server` you would avoid +copying large files from NFS server to the machine and DVC will manage the links +from the workspace to cache. + +First configure NFS server and client machine, following this +[link](https://vitux.com/install-nfs-server-and-client-on-ubuntu/). + +# Real data + +With DVC, we can easily setup a shared cache storage on the NFS server that will +allow your team to share and store data for your projects as effectively as +possible and have a workspace restoration/switching speed as instant +as`git checkout` for your code. + +### Preparation + +In order to make it work on a shared server, we need to setup a shared cache +location for your projects, so that every team member is using the same cache +location. + +After configuring NFS on both server and client side. Let's create an export +directory on server side where all data will be stored. + +```dvc +mkdir -p /storage +``` + +You will have to make sure that the directory has proper permissions setup,so +that every one on your team can read and write to it and can access cache files +written by others. The most straightforward way to do that is to make sure that +you and your colleagues are members of the same group (e.g. 'users') and that +your shared directory is owned by that group and has respective permissions. + +Let's create a mount point of client side. + +```dvc +mkdir -p /mnt/dataset/ +``` + +### Configure Cache + +After mounting the shared directory on client side. Assuming project code is in +`/home/user/project1`. Let's initialize a `dvc repo`. + +```dvc +cd /home/user/project1/ +dvc init +git add .dvc .gitignore +git commit . -m "initialize DVC" +``` + +With `dvc init`, we initialized a DVC repository. DVC will start tracking all +the changes. + +Tell DVC to use the directory we've set up as an external cache location by +running: + +```dvc +dvc cache dir /mnt/dataset/storage +dvc config cache.type "reflink,symlink,hardlink,copy" +dvc config cache.protected true +git add .dvc .gitignore +git commit . -m "DVC cache location updated" +``` + +By default cache is present in the `.dvc/cache` location. `dvc cache dir` +changes the location of cache directory to `/mnt/dataset/storage` + +`cache.type "reflink,symlink,hardlink,copy"` - enables symlinks to avoid copying +large files. + +`cache.protected true` - to make links `read only` so that we you don't corrupt +data accidentally + +Also, let git know about the changes we have done. + +### Add data to DVC cache + +Now, add first version of the dataset into the DVC cache (this is done once for +a dataset). + +```dvc +cd /mnt/dataset/ +cp -r . /home/user/project1/ +cd /home/user/project1 +mv /mnt/dataset/project1_data/ data/ +dvc add data +``` + +`dvc add data` will take files in `data` directory under DVC control. By default +an added file is committed to the DVC cache. + +Commit changes to `.dvc/config` and push them to your git remote: + +```dvc +git add data.dvc .gitignore +git commit . -m "add first version of the dataset" +git tag -a "v1.0" -m "dataset v1.0" +git push origin HEAD +git push origin v1.0 +``` + +Next, you can easily get this appear in your workspace by: + +```dvc +cd /home/user/project1/ +git pull +dvc checkout +``` + +After `git pull`, you will be able to see a `data.dvc` file. To see more +information on `.dvc` file, visit [here](/doc/user-guide/dvc-file-format). + +`data` directory will now be a symbolic link to the NFS storage. From d072afc5cf95399104ea4974fc9a9f20e41efbfb Mon Sep 17 00:00:00 2001 From: Himanshu Nailwal Date: Tue, 25 Jun 2019 21:25:37 +0530 Subject: [PATCH 02/13] shared-storage-on-nfs --- .../docs/use-cases/shared-storage-on-nfs.md | 130 ++++++++++-------- 1 file changed, 72 insertions(+), 58 deletions(-) diff --git a/static/docs/use-cases/shared-storage-on-nfs.md b/static/docs/use-cases/shared-storage-on-nfs.md index b740f17a70..51684dee0d 100644 --- a/static/docs/use-cases/shared-storage-on-nfs.md +++ b/static/docs/use-cases/shared-storage-on-nfs.md @@ -3,42 +3,40 @@ In the modern software development environment, teams are working together on same dataset to get the results. It became necessary that data is accessible and every team member has a same updated dataset. For this example, we will be using -NFS (Network File System) for storing and sharing files on the network. +NFS (Network File System) for storing and sharing files on the network. This +allows you to have better resource utilization such as ability to store large +disk consuming dataset on a single host machine. + +For optimizing the performance, we can set the `cache directory` on NFS server +by configuring the DVC repository from making changes in the DVC config file +which is present in `.dvc/config` location. With DVC, you can easily setup a +shared cache storage on the NFS server that will allow your team to share and +store data for your projects effectively as possible and have a workspace +restoration/switching speed as instant as `git checkout` for your code. + +With large data files it is better to set the cache directory to external NFS. +Not only just it will cache the data faster but also version the data. Suppose, +we have a dataset with 1 million images. With DVC, we can have multiple versions +of a dataset without affecting each other work and without creating duplicates +of a complete dataset. With `cache directory` set to `NFS server` you would +avoid copying large files from NFS server to the machine and DVC will manage the +links from the workspace to cache. For more information, visit +[Data and Model Files Versioning](/doc/use-cases/data-and-model-files-versioning). -With DVC, we need not to copy the dataset on our local machine everytime when -new data is added to dataset. We can set the `cache directory` on NFS server. -The cached data will be present in the NFS server, which in turn will be fast to -access and process requests faster. - -With large data files it is better to set the cache directory to NFS. Not only -just it will cache the data faster but also version the data. Suppose, we have a -dataset with 1 million images. With DVC, we can have multiple versions of a -dataset without affecting each other work and without creating duplicates of a -complete dataset. With `cache directory` set to `NFS server` you would avoid -copying large files from NFS server to the machine and DVC will manage the links -from the workspace to cache. +### Preparation First configure NFS server and client machine, following this [link](https://vitux.com/install-nfs-server-and-client-on-ubuntu/). -# Real data - -With DVC, we can easily setup a shared cache storage on the NFS server that will -allow your team to share and store data for your projects as effectively as -possible and have a workspace restoration/switching speed as instant -as`git checkout` for your code. - -### Preparation - -In order to make it work on a shared server, we need to setup a shared cache -location for your projects, so that every team member is using the same cache -location. +In order to make it work on a shared server, after configuring NFS server and +client we need to setup a shared cache location for your projects, so that every +team member is using the same cache location. After configuring NFS on both server and client side. Let's create an export directory on server side where all data will be stored. ```dvc -mkdir -p /storage +$ mkdir -p /storage ``` You will have to make sure that the directory has proper permissions setup,so @@ -50,19 +48,23 @@ your shared directory is owned by that group and has respective permissions. Let's create a mount point of client side. ```dvc -mkdir -p /mnt/dataset/ +$ mkdir -p /mnt/dataset/ ``` -### Configure Cache +From `/mnt/dataset/` you will be able to access `/storage` directory present in +host server from your local machine. + +### Configuring Cache location -After mounting the shared directory on client side. Assuming project code is in -`/home/user/project1`. Let's initialize a `dvc repo`. +After mounting the shared directory on client side. Assuming project code is +present in `/home/user/project1`. Let's initialize a `dvc repo`. ```dvc -cd /home/user/project1/ -dvc init -git add .dvc .gitignore -git commit . -m "initialize DVC" +$ cd /home/user/project1/ +$ git init +$ dvc init +$ git add .dvc .gitignore +$ git commit . -m "initialize DVC" ``` With `dvc init`, we initialized a DVC repository. DVC will start tracking all @@ -72,59 +74,71 @@ Tell DVC to use the directory we've set up as an external cache location by running: ```dvc -dvc cache dir /mnt/dataset/storage -dvc config cache.type "reflink,symlink,hardlink,copy" -dvc config cache.protected true -git add .dvc .gitignore -git commit . -m "DVC cache location updated" +$ dvc config cache.dir /mnt/dataset/storage +$ dvc config cache.type "reflink,symlink,hardlink,copy" +$ dvc config cache.protected true +$ git add .dvc .gitignore +$ git commit . -m "DVC cache location updated" ``` By default cache is present in the `.dvc/cache` location. `dvc cache dir` changes the location of cache directory to `/mnt/dataset/storage` -`cache.type "reflink,symlink,hardlink,copy"` - enables symlinks to avoid copying -large files. +`config cache.dir /path/to/cache/directory` - sets cache directory location. +Alternatively, we can also use `dvc cache dir /path/to/cache/directory`. + +`cache.type "reflink,symlink,hardlink,copy"` - link type that DVC should use to +link data files from cache to your workspace. It enables symlinks to avoid +copying large files. `cache.protected true` - to make links `read only` so that we you don't corrupt -data accidentally +data accidentally present in the workspace. + +For more information on `config` options, visit +[here](https://dvc.org/doc/commands-reference/config#configuration-sections) Also, let git know about the changes we have done. -### Add data to DVC cache +#### Add data to DVC cache Now, add first version of the dataset into the DVC cache (this is done once for a dataset). ```dvc -cd /mnt/dataset/ -cp -r . /home/user/project1/ -cd /home/user/project1 -mv /mnt/dataset/project1_data/ data/ -dvc add data +$ cd /mnt/dataset/ +$ cp -r . /home/user/project1/ +$ cd /home/user/project1 +$ mv /mnt/dataset/project1_data/ data/ +$ dvc add data ``` +After copying the data, we have moved the data that is present in the +`/mnt/dataset/project1_data/`vto `./data` directory. This is only done once for +a dataset. + `dvc add data` will take files in `data` directory under DVC control. By default an added file is committed to the DVC cache. -Commit changes to `.dvc/config` and push them to your git remote: +Now, commit changes to `.dvc/config` and push them to your git remote: ```dvc -git add data.dvc .gitignore -git commit . -m "add first version of the dataset" -git tag -a "v1.0" -m "dataset v1.0" -git push origin HEAD -git push origin v1.0 +$ git add data.dvc .gitignore +$ git commit . -m "add first version of the dataset" +$ git tag -a "v1.0" -m "dataset v1.0" +$ git push origin HEAD +$ git push origin v1.0 ``` Next, you can easily get this appear in your workspace by: ```dvc -cd /home/user/project1/ -git pull -dvc checkout +$ cd /home/user/project1/ +$ git pull +$ dvc checkout ``` After `git pull`, you will be able to see a `data.dvc` file. To see more -information on `.dvc` file, visit [here](/doc/user-guide/dvc-file-format). +information on `.dvc` file format, visit +[here](/doc/user-guide/dvc-file-format). `data` directory will now be a symbolic link to the NFS storage. From 234172c23964c0d1dfd81d3daa391832ec9b165b Mon Sep 17 00:00:00 2001 From: Himanshu Nailwal Date: Thu, 27 Jun 2019 08:44:22 +0530 Subject: [PATCH 03/13] removed # --- static/docs/use-cases/shared-storage-on-nfs.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/static/docs/use-cases/shared-storage-on-nfs.md b/static/docs/use-cases/shared-storage-on-nfs.md index 51684dee0d..669214cda2 100644 --- a/static/docs/use-cases/shared-storage-on-nfs.md +++ b/static/docs/use-cases/shared-storage-on-nfs.md @@ -23,7 +23,7 @@ avoid copying large files from NFS server to the machine and DVC will manage the links from the workspace to cache. For more information, visit [Data and Model Files Versioning](/doc/use-cases/data-and-model-files-versioning). -### Preparation +## Preparation First configure NFS server and client machine, following this [link](https://vitux.com/install-nfs-server-and-client-on-ubuntu/). @@ -54,7 +54,7 @@ $ mkdir -p /mnt/dataset/ From `/mnt/dataset/` you will be able to access `/storage` directory present in host server from your local machine. -### Configuring Cache location +## Configuring Cache location After mounting the shared directory on client side. Assuming project code is present in `/home/user/project1`. Let's initialize a `dvc repo`. @@ -99,7 +99,7 @@ For more information on `config` options, visit Also, let git know about the changes we have done. -#### Add data to DVC cache +## Add data to DVC cache Now, add first version of the dataset into the DVC cache (this is done once for a dataset). From f64f3e8cd458801c82967d0eae4de3e787584fef Mon Sep 17 00:00:00 2001 From: Himanshu Nailwal Date: Mon, 8 Jul 2019 11:32:30 +0530 Subject: [PATCH 04/13] changes changes according to the suggestions. working on other too. --- .../docs/use-cases/shared-storage-on-nfs.md | 52 ++++++++++--------- 1 file changed, 28 insertions(+), 24 deletions(-) diff --git a/static/docs/use-cases/shared-storage-on-nfs.md b/static/docs/use-cases/shared-storage-on-nfs.md index 669214cda2..ffef176dc7 100644 --- a/static/docs/use-cases/shared-storage-on-nfs.md +++ b/static/docs/use-cases/shared-storage-on-nfs.md @@ -5,14 +5,12 @@ same dataset to get the results. It became necessary that data is accessible and every team member has a same updated dataset. For this example, we will be using NFS (Network File System) for storing and sharing files on the network. This allows you to have better resource utilization such as ability to store large -disk consuming dataset on a single host machine. +datasets on a single host machine. -For optimizing the performance, we can set the `cache directory` on NFS server -by configuring the DVC repository from making changes in the DVC config file -which is present in `.dvc/config` location. With DVC, you can easily setup a -shared cache storage on the NFS server that will allow your team to share and -store data for your projects effectively as possible and have a workspace -restoration/switching speed as instant as `git checkout` for your code. +With DVC, you can easily setup a shared cache storage on the NFS server that +will allow your team to share and store data for your projects effectively as +possible and have a workspace restoration/switching speed as instant as +`git checkout` for your code. With large data files it is better to set the cache directory to external NFS. Not only just it will cache the data faster but also version the data. Suppose, @@ -39,7 +37,7 @@ directory on server side where all data will be stored. $ mkdir -p /storage ``` -You will have to make sure that the directory has proper permissions setup,so +You will have to make sure that the directory has proper permissions setup, so that every one on your team can read and write to it and can access cache files written by others. The most straightforward way to do that is to make sure that you and your colleagues are members of the same group (e.g. 'users') and that @@ -57,48 +55,54 @@ host server from your local machine. ## Configuring Cache location After mounting the shared directory on client side. Assuming project code is -present in `/home/user/project1`. Let's initialize a `dvc repo`. +present in `/project1`. Let's initialize a `dvc repo`. ```dvc -$ cd /home/user/project1/ +$ cd /project1/ $ git init $ dvc init $ git add .dvc .gitignore $ git commit . -m "initialize DVC" ``` -With `dvc init`, we initialized a DVC repository. DVC will start tracking all -the changes. +With `dvc init`, we initialized a DVC repository. For more information, visit +[here](/doc/get-started/initialize). Tell DVC to use the directory we've set up as an external cache location by running: ```dvc $ dvc config cache.dir /mnt/dataset/storage -$ dvc config cache.type "reflink,symlink,hardlink,copy" -$ dvc config cache.protected true -$ git add .dvc .gitignore -$ git commit . -m "DVC cache location updated" ``` -By default cache is present in the `.dvc/cache` location. `dvc cache dir` -changes the location of cache directory to `/mnt/dataset/storage` - `config cache.dir /path/to/cache/directory` - sets cache directory location. Alternatively, we can also use `dvc cache dir /path/to/cache/directory`. +```dvc +$ dvc config cache.type "reflink,symlink,hardlink,copy" +``` + `cache.type "reflink,symlink,hardlink,copy"` - link type that DVC should use to link data files from cache to your workspace. It enables symlinks to avoid copying large files. +```dvc +$ dvc config cache.protected true +``` + `cache.protected true` - to make links `read only` so that we you don't corrupt data accidentally present in the workspace. +Also, let git know about the changes we have done. + +```dvc +$ git add .dvc .gitignore +$ git commit . -m "DVC cache location updated" +``` + For more information on `config` options, visit [here](https://dvc.org/doc/commands-reference/config#configuration-sections) -Also, let git know about the changes we have done. - ## Add data to DVC cache Now, add first version of the dataset into the DVC cache (this is done once for @@ -106,14 +110,14 @@ a dataset). ```dvc $ cd /mnt/dataset/ -$ cp -r . /home/user/project1/ -$ cd /home/user/project1 +$ cp -r . /project1/ +$ cd /project1 $ mv /mnt/dataset/project1_data/ data/ $ dvc add data ``` After copying the data, we have moved the data that is present in the -`/mnt/dataset/project1_data/`vto `./data` directory. This is only done once for +`/mnt/dataset/project1_data/` to `./data` directory. This is only done once for a dataset. `dvc add data` will take files in `data` directory under DVC control. By default From fa047e8b333e42df7df9d1d32bb2ca5ee2c546fe Mon Sep 17 00:00:00 2001 From: Himanshu Nailwal Date: Mon, 8 Jul 2019 11:37:12 +0530 Subject: [PATCH 05/13] git -> Git --- static/docs/use-cases/shared-storage-on-nfs.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/static/docs/use-cases/shared-storage-on-nfs.md b/static/docs/use-cases/shared-storage-on-nfs.md index ffef176dc7..a0c52a7237 100644 --- a/static/docs/use-cases/shared-storage-on-nfs.md +++ b/static/docs/use-cases/shared-storage-on-nfs.md @@ -93,7 +93,7 @@ $ dvc config cache.protected true `cache.protected true` - to make links `read only` so that we you don't corrupt data accidentally present in the workspace. -Also, let git know about the changes we have done. +Also, let Git know about the changes we have done. ```dvc $ git add .dvc .gitignore From 740258865d501a2802fc046984fbb0da9904f097 Mon Sep 17 00:00:00 2001 From: Himanshu Nailwal Date: Mon, 8 Jul 2019 11:59:09 +0530 Subject: [PATCH 06/13] NFS-storage intro updated --- .../docs/use-cases/shared-storage-on-nfs.md | 19 +++++++++---------- 1 file changed, 9 insertions(+), 10 deletions(-) diff --git a/static/docs/use-cases/shared-storage-on-nfs.md b/static/docs/use-cases/shared-storage-on-nfs.md index a0c52a7237..d66cdb7c9e 100644 --- a/static/docs/use-cases/shared-storage-on-nfs.md +++ b/static/docs/use-cases/shared-storage-on-nfs.md @@ -2,10 +2,10 @@ In the modern software development environment, teams are working together on same dataset to get the results. It became necessary that data is accessible and -every team member has a same updated dataset. For this example, we will be using -NFS (Network File System) for storing and sharing files on the network. This -allows you to have better resource utilization such as ability to store large -datasets on a single host machine. +every team member has a same updated dataset. NFS (Network File System) storage +is widely used for storing and sharing files on the network. This allows you to +have better resource utilization such as ability to store large datasets on a +single host machine. With DVC, you can easily setup a shared cache storage on the NFS server that will allow your team to share and store data for your projects effectively as @@ -72,11 +72,10 @@ Tell DVC to use the directory we've set up as an external cache location by running: ```dvc -$ dvc config cache.dir /mnt/dataset/storage +$ dvc cache dir /mnt/dataset/storage ``` -`config cache.dir /path/to/cache/directory` - sets cache directory location. -Alternatively, we can also use `dvc cache dir /path/to/cache/directory`. +`dvc cache dir /path/to/cache/directory` - sets cache directory location. ```dvc $ dvc config cache.type "reflink,symlink,hardlink,copy" @@ -93,6 +92,9 @@ $ dvc config cache.protected true `cache.protected true` - to make links `read only` so that we you don't corrupt data accidentally present in the workspace. +For more information on `config` options, visit +[here](https://dvc.org/doc/commands-reference/config#configuration-sections). + Also, let Git know about the changes we have done. ```dvc @@ -100,9 +102,6 @@ $ git add .dvc .gitignore $ git commit . -m "DVC cache location updated" ``` -For more information on `config` options, visit -[here](https://dvc.org/doc/commands-reference/config#configuration-sections) - ## Add data to DVC cache Now, add first version of the dataset into the DVC cache (this is done once for From 2fbff29c93d3354e3c98b3d9c46e2a893e344f5a Mon Sep 17 00:00:00 2001 From: Himanshu Nailwal Date: Mon, 8 Jul 2019 12:04:59 +0530 Subject: [PATCH 07/13] Data and Model Files Versioning link removed --- static/docs/use-cases/shared-storage-on-nfs.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/static/docs/use-cases/shared-storage-on-nfs.md b/static/docs/use-cases/shared-storage-on-nfs.md index d66cdb7c9e..37a1300174 100644 --- a/static/docs/use-cases/shared-storage-on-nfs.md +++ b/static/docs/use-cases/shared-storage-on-nfs.md @@ -18,8 +18,7 @@ we have a dataset with 1 million images. With DVC, we can have multiple versions of a dataset without affecting each other work and without creating duplicates of a complete dataset. With `cache directory` set to `NFS server` you would avoid copying large files from NFS server to the machine and DVC will manage the -links from the workspace to cache. For more information, visit -[Data and Model Files Versioning](/doc/use-cases/data-and-model-files-versioning). +links from the workspace to cache. ## Preparation From c44563d7f836a1e4faf793197ce06cb19d0419a7 Mon Sep 17 00:00:00 2001 From: Himanshu Nailwal Date: Mon, 8 Jul 2019 12:07:14 +0530 Subject: [PATCH 08/13] project1 -> project --- static/docs/use-cases/shared-storage-on-nfs.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/static/docs/use-cases/shared-storage-on-nfs.md b/static/docs/use-cases/shared-storage-on-nfs.md index 37a1300174..e7a5075e52 100644 --- a/static/docs/use-cases/shared-storage-on-nfs.md +++ b/static/docs/use-cases/shared-storage-on-nfs.md @@ -54,10 +54,10 @@ host server from your local machine. ## Configuring Cache location After mounting the shared directory on client side. Assuming project code is -present in `/project1`. Let's initialize a `dvc repo`. +present in `/project`. Let's initialize a `dvc repo`. ```dvc -$ cd /project1/ +$ cd /project/ $ git init $ dvc init $ git add .dvc .gitignore @@ -108,15 +108,15 @@ a dataset). ```dvc $ cd /mnt/dataset/ -$ cp -r . /project1/ +$ cp -r . /project/ $ cd /project1 -$ mv /mnt/dataset/project1_data/ data/ +$ mv /mnt/dataset/project_data/ data/ $ dvc add data ``` After copying the data, we have moved the data that is present in the -`/mnt/dataset/project1_data/` to `./data` directory. This is only done once for -a dataset. +`/mnt/dataset/project_data/` to `./data` directory. This is only done once for a +dataset. `dvc add data` will take files in `data` directory under DVC control. By default an added file is committed to the DVC cache. From e8ed1036ee7d682969b28f51043ad71a016c8bb8 Mon Sep 17 00:00:00 2001 From: Himanshu Nailwal Date: Mon, 8 Jul 2019 12:10:13 +0530 Subject: [PATCH 09/13] formatted a sentence to bold --- static/docs/use-cases/shared-storage-on-nfs.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/static/docs/use-cases/shared-storage-on-nfs.md b/static/docs/use-cases/shared-storage-on-nfs.md index e7a5075e52..ac40a5b4c0 100644 --- a/static/docs/use-cases/shared-storage-on-nfs.md +++ b/static/docs/use-cases/shared-storage-on-nfs.md @@ -67,8 +67,8 @@ $ git commit . -m "initialize DVC" With `dvc init`, we initialized a DVC repository. For more information, visit [here](/doc/get-started/initialize). -Tell DVC to use the directory we've set up as an external cache location by -running: +**Tell DVC to use the directory we've set up as an external cache location by +running:** ```dvc $ dvc cache dir /mnt/dataset/storage From 27f857219080569a1c91cf1cd5040ad79ca678c3 Mon Sep 17 00:00:00 2001 From: Himanshu Nailwal Date: Mon, 8 Jul 2019 12:15:27 +0530 Subject: [PATCH 10/13] large-dataset-optimization link introduced --- static/docs/use-cases/shared-storage-on-nfs.md | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/static/docs/use-cases/shared-storage-on-nfs.md b/static/docs/use-cases/shared-storage-on-nfs.md index ac40a5b4c0..3d97b3c83e 100644 --- a/static/docs/use-cases/shared-storage-on-nfs.md +++ b/static/docs/use-cases/shared-storage-on-nfs.md @@ -82,7 +82,8 @@ $ dvc config cache.type "reflink,symlink,hardlink,copy" `cache.type "reflink,symlink,hardlink,copy"` - link type that DVC should use to link data files from cache to your workspace. It enables symlinks to avoid -copying large files. +copying large files. For more information, vist +[here](/doc/user-guide/large-dataset-optimization). ```dvc $ dvc config cache.protected true @@ -91,9 +92,6 @@ $ dvc config cache.protected true `cache.protected true` - to make links `read only` so that we you don't corrupt data accidentally present in the workspace. -For more information on `config` options, visit -[here](https://dvc.org/doc/commands-reference/config#configuration-sections). - Also, let Git know about the changes we have done. ```dvc From c99cc0484e1f31e5085038459539a709148fa2fc Mon Sep 17 00:00:00 2001 From: Himanshu Nailwal Date: Mon, 22 Jul 2019 13:45:33 +0530 Subject: [PATCH 11/13] project1 -> project --- static/docs/use-cases/shared-storage-on-nfs.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/static/docs/use-cases/shared-storage-on-nfs.md b/static/docs/use-cases/shared-storage-on-nfs.md index 3d97b3c83e..cf9bb42ddf 100644 --- a/static/docs/use-cases/shared-storage-on-nfs.md +++ b/static/docs/use-cases/shared-storage-on-nfs.md @@ -107,7 +107,7 @@ a dataset). ```dvc $ cd /mnt/dataset/ $ cp -r . /project/ -$ cd /project1 +$ cd /project $ mv /mnt/dataset/project_data/ data/ $ dvc add data ``` @@ -132,7 +132,7 @@ $ git push origin v1.0 Next, you can easily get this appear in your workspace by: ```dvc -$ cd /home/user/project1/ +$ cd /home/user/project/ $ git pull $ dvc checkout ``` From 1b24399c8a60c27e4eee8e4ed17524596ef21a35 Mon Sep 17 00:00:00 2001 From: Himanshu Nailwal Date: Mon, 22 Jul 2019 13:59:08 +0530 Subject: [PATCH 12/13] why cache.protected true info added --- static/docs/use-cases/shared-storage-on-nfs.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/static/docs/use-cases/shared-storage-on-nfs.md b/static/docs/use-cases/shared-storage-on-nfs.md index cf9bb42ddf..0ea6af11aa 100644 --- a/static/docs/use-cases/shared-storage-on-nfs.md +++ b/static/docs/use-cases/shared-storage-on-nfs.md @@ -90,7 +90,9 @@ $ dvc config cache.protected true ``` `cache.protected true` - to make links `read only` so that we you don't corrupt -data accidentally present in the workspace. +data accidentally present in the workspace. Since, we are using `symlinks` +between the cache and local workspace because both are located on different +filesystem. Also, let Git know about the changes we have done. From 31c5d424c6530bb793af69c2af578d2b8a374d02 Mon Sep 17 00:00:00 2001 From: Himanshu Nailwal Date: Wed, 31 Jul 2019 23:15:12 +0530 Subject: [PATCH 13/13] dvc unprotect info added --- static/docs/use-cases/shared-storage-on-nfs.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/static/docs/use-cases/shared-storage-on-nfs.md b/static/docs/use-cases/shared-storage-on-nfs.md index 0ea6af11aa..cbe34f2152 100644 --- a/static/docs/use-cases/shared-storage-on-nfs.md +++ b/static/docs/use-cases/shared-storage-on-nfs.md @@ -119,7 +119,9 @@ After copying the data, we have moved the data that is present in the dataset. `dvc add data` will take files in `data` directory under DVC control. By default -an added file is committed to the DVC cache. +an added file is committed to the DVC cache. After `dvc add` dvc will +`unprotect` all the data. For more information, visit +[here](/doc/user-guide/update-tracked-file). Now, commit changes to `.dvc/config` and push them to your git remote: