From e66c9c4541bf51a9a97bb7ff0348a790aab555dd Mon Sep 17 00:00:00 2001 From: Dashamir Hoxha Date: Fri, 16 Aug 2019 17:08:47 +0200 Subject: [PATCH 01/11] Use Case: Huge data on an extarnal local drive --- .../huge-data-on-an-external-drive.md | 138 ++++++++++++++++++ 1 file changed, 138 insertions(+) create mode 100644 static/docs/use-cases/huge-data-on-an-external-drive.md diff --git a/static/docs/use-cases/huge-data-on-an-external-drive.md b/static/docs/use-cases/huge-data-on-an-external-drive.md new file mode 100644 index 0000000000..04d25bf0b1 --- /dev/null +++ b/static/docs/use-cases/huge-data-on-an-external-drive.md @@ -0,0 +1,138 @@ +# Huge Data On An External Local Drive + +Sometimes the data may be huge and they are stored on an external local drive. +By "huge" we mean that they won't fit on our home directory, and even if they +did, it would certainly take a long time to copy them back an forth from the +external drive to our home directory. For example let's say that the data are +stored on an external HDD drive of size 16TB, which is mounted on `/mnt/data/`, +while the disk of our home directory has a size of only 320GB. + +In this case we would like to process the data where they are (on the external +drive), to save the results there, and certainly to store the cached files on +the external drive too. + +The most easy way to do this would be to locate the workspace on +the external drive as well, which could be done like this: + +```dvc +$ sudo su +# cd /mnt/data/ +# git init +# dvc init +``` + +But in case this is not possible (or is not preferable), we can easily setup the +workspace on our home directory, while all the data files and their +caches keep staying on the external drive. DVC will still be able to track them +properly. + +### Make the data directory accessible + +For this to work, first you have to make sure that you can read and write the +data directory `/mnt/data/`. The most straightforward way to do this is by +setting proper ownership and permissions to it, like this: + +```dvc +$ sudo chown : -R /mnt/data/ +$ chmod u+rw -R /mnt/data/ +``` + +### Start a DVC project and setup a local external cache + +An _external_ cache is called so because it resides outside of your +workspace directory. We also call it _local_ because it is located +within our filesystem (as opposed to being located somewhere on the internet, in +which case it is called _remote_). Let's create a directory for it: + +```dvc +$ mkdir -p /mnt/data/dvc-cache +``` + +Now you can initialize a project on your home directory and configure it to use +the external cache directory: + +```dvc +$ cd ~/project/ +$ git init +$ dvc init + +$ dvc config cache.dir /mnt/data/dvc-cache +$ rm -rf .dvc/cache/ + +$ git add .dvc/config +$ git commit -m 'DVC with external cache dir' +``` + +If you check the config file you will see something like this: + +```dvc +$ cat .dvc/config +[cache] +dir = /mnt/data/dvc-cache +``` + +### Example of tracking external dependencies and outputs + +Now, when you refer to the data files and directories, you have to use their +absolute path. The DVC-files will be created on the project +directory, and you can track their modifications with `git` as usual. + +For example let's say that the raw data are on `/mnt/data/raw/` you are cleaning +them up. You could do it like this: + +```dvc +$ dvc add /mnt/data/raw + +$ dvc run -f clean.dvc \ + -d /mnt/data/raw \ + -o /mnt/data/clean \ + ./cleanup.py /mnt/data/raw /mnt/data/clean +``` + +If you check the contents of the files `raw.dvc` and `clean.dvc` you will notice +that their `path:` field refers to the external directories: + +```dvc +$ cat raw.dvc +md5: 9cbbacd47133debf91dcb41891c64730 +wdir: . +outs: +- md5: 0ee0a6bc0a1f1be0610f7a3f67f1cb54.dir + path: /mnt/data/raw + cache: true + metric: false + persist: false + +$ cat clean.dvc +md5: 2b842ed58b1792dde6df27e3d0f73430 +cmd: cp -a /mnt/data/raw /mnt/data/clean +wdir: . +deps: +- md5: 0ee0a6bc0a1f1be0610f7a3f67f1cb54.dir + path: /mnt/data/raw +outs: +- md5: 0ee0a6bc0a1f1be0610f7a3f67f1cb54.dir + path: /mnt/data/clean + cache: true + metric: false + persist: false +``` + +You can also check and verify that indeed all the data and cache files are +stored on the external drive: + +```dvc +$ ls /mnt/data/ +clean dvc-cache raw + +$ ls /mnt/data/dvc-cache +. . . + +``` + +Now you can add and commit the DVC-files to git: + +```dvc +$ git add raw.dvc clean.dvc +$ git commit -m "cleanup raw data" +``` From 30e323dadc9357a86042a48e67c50a849cab5b80 Mon Sep 17 00:00:00 2001 From: Dashamir Hoxha Date: Fri, 16 Aug 2019 17:58:15 +0200 Subject: [PATCH 02/11] Add a note --- .../huge-data-on-an-external-drive.md | 28 +++++++++++++++---- 1 file changed, 23 insertions(+), 5 deletions(-) diff --git a/static/docs/use-cases/huge-data-on-an-external-drive.md b/static/docs/use-cases/huge-data-on-an-external-drive.md index 04d25bf0b1..63349dac36 100644 --- a/static/docs/use-cases/huge-data-on-an-external-drive.md +++ b/static/docs/use-cases/huge-data-on-an-external-drive.md @@ -26,7 +26,7 @@ But in case this is not possible (or is not preferable), we can easily setup the caches keep staying on the external drive. DVC will still be able to track them properly. -### Make the data directory accessible +## Make the data directory accessible For this to work, first you have to make sure that you can read and write the data directory `/mnt/data/`. The most straightforward way to do this is by @@ -37,7 +37,7 @@ $ sudo chown : -R /mnt/data/ $ chmod u+rw -R /mnt/data/ ``` -### Start a DVC project and setup a local external cache +## Start a DVC project and setup a local external cache An _external_ cache is called so because it resides outside of your workspace directory. We also call it _local_ because it is located @@ -71,14 +71,14 @@ $ cat .dvc/config dir = /mnt/data/dvc-cache ``` -### Example of tracking external dependencies and outputs +## Example of tracking external dependencies and outputs Now, when you refer to the data files and directories, you have to use their absolute path. The DVC-files will be created on the project directory, and you can track their modifications with `git` as usual. -For example let's say that the raw data are on `/mnt/data/raw/` you are cleaning -them up. You could do it like this: +For example let's say that the raw data are on `/mnt/data/raw/` and you are +cleaning them up. You could do it like this: ```dvc $ dvc add /mnt/data/raw @@ -89,6 +89,24 @@ $ dvc run -f clean.dvc \ ./cleanup.py /mnt/data/raw /mnt/data/clean ``` +
+ +### Using an environment variable for the data path + +In a real life situation probabaly you would declare an environment variable +`DATA_PATH=/mnt/data` and use it to shorten the command options, like this: + +```dvc +$ dvc add $DATA_PATH/raw + +$ dvc run -f clean.dvc \ + -d $DATA_PATH/raw \ + -o $DATA_PATH/clean \ + ./cleanup.py $DATA_PATH/raw $DATA_PATH/clean +``` + +
+ If you check the contents of the files `raw.dvc` and `clean.dvc` you will notice that their `path:` field refers to the external directories: From 444058a6f3011888e836e06e14265078d86f1fc1 Mon Sep 17 00:00:00 2001 From: Dashamir Hoxha Date: Sat, 17 Aug 2019 12:05:00 +0200 Subject: [PATCH 03/11] Fixing and extending --- .../huge-data-on-an-external-drive.md | 127 ++++++++++++------ 1 file changed, 89 insertions(+), 38 deletions(-) diff --git a/static/docs/use-cases/huge-data-on-an-external-drive.md b/static/docs/use-cases/huge-data-on-an-external-drive.md index 63349dac36..7f05dec3a3 100644 --- a/static/docs/use-cases/huge-data-on-an-external-drive.md +++ b/static/docs/use-cases/huge-data-on-an-external-drive.md @@ -1,18 +1,20 @@ -# Huge Data On An External Local Drive +# Managing Data Storage On An External Hard Drive -Sometimes the data may be huge and they are stored on an external local drive. +Sometimes the data may be huge and stored on an +[external hard drive](https://whatis.techtarget.com/definition/external-hard-drive). By "huge" we mean that they won't fit on our home directory, and even if they did, it would certainly take a long time to copy them back an forth from the external drive to our home directory. For example let's say that the data are -stored on an external HDD drive of size 16TB, which is mounted on `/mnt/data/`, -while the disk of our home directory has a size of only 320GB. +stored on an external hard drive of size 16TB, while the hard drive of our home +directory has a size of only 320GB. -In this case we would like to process the data where they are (on the external -drive), to save the results there, and certainly to store the cached files on -the external drive too. +In this case we would like to process the data where they are located (on the +external drive). We also would like to save the results there, and certainly to +store the cached files there as well. The most easy way to do this would be to locate the workspace on -the external drive as well, which could be done like this: +the external drive itself. If we assume that the external drive is mounted on +`/mnt/data/`, then it could be done like this: ```dvc $ sudo su @@ -22,9 +24,8 @@ $ sudo su ``` But in case this is not possible (or is not preferable), we can easily setup the -workspace on our home directory, while all the data files and their -caches keep staying on the external drive. DVC will still be able to track them -properly. +workspace in our home directory, while all the data files and their caches keep +staying on the external drive. DVC will still be able to track them properly. ## Make the data directory accessible @@ -37,19 +38,17 @@ $ sudo chown : -R /mnt/data/ $ chmod u+rw -R /mnt/data/ ``` -## Start a DVC project and setup a local external cache +## Start a DVC project and setup an external cache -An _external_ cache is called so because it resides outside of your -workspace directory. We also call it _local_ because it is located -within our filesystem (as opposed to being located somewhere on the internet, in -which case it is called _remote_). Let's create a directory for it: +An _external_ cache is called so because it resides outside of the +workspace directory. Let's create a directory for it on `/mnt/data/`: ```dvc $ mkdir -p /mnt/data/dvc-cache ``` -Now you can initialize a project on your home directory and configure it to use -the external cache directory: +Now you can initialize a project on your home directory and +configure it to use the external cache directory: ```dvc $ cd ~/project/ @@ -60,10 +59,26 @@ $ dvc config cache.dir /mnt/data/dvc-cache $ rm -rf .dvc/cache/ $ git add .dvc/config -$ git commit -m 'DVC with external cache dir' +$ git commit -m 'Initialize DVC with external cache' ``` -If you check the config file you will see something like this: +
+ +### Transfer the content of the cache to the external directory + +In this example we are just removing the default cache directory `.dvc/cache/` +because we just initialized the project and we know that it is empty (there's +nothing stored in it). If we had an existing project, we could preserve the +content of the cache by moving it to the new directory: + +```dvc +$ mv -a .dvc/cache/* /mnt/data/dvc-cache/ +$ rm -rf .dvc/cache/ +``` + +
+ +If you check the config file you should see something like this: ```dvc $ cat .dvc/config @@ -71,13 +86,13 @@ $ cat .dvc/config dir = /mnt/data/dvc-cache ``` -## Example of tracking external dependencies and outputs +## Tracking external dependencies and outputs Now, when you refer to the data files and directories, you have to use their -absolute path. The DVC-files will be created on the project +**absolute path**. The DVC-files will be created on the project directory, and you can track their modifications with `git` as usual. -For example let's say that the raw data are on `/mnt/data/raw/` and you are +For example let's say that the raw data file are on `/mnt/data/raw/` and you are cleaning them up. You could do it like this: ```dvc @@ -93,7 +108,7 @@ $ dvc run -f clean.dvc \ ### Using an environment variable for the data path -In a real life situation probabaly you would declare an environment variable +In a real life situation probably you would declare an environment variable `DATA_PATH=/mnt/data` and use it to shorten the command options, like this: ```dvc @@ -112,28 +127,36 @@ that their `path:` field refers to the external directories: ```dvc $ cat raw.dvc +``` + +```yaml md5: 9cbbacd47133debf91dcb41891c64730 wdir: . outs: -- md5: 0ee0a6bc0a1f1be0610f7a3f67f1cb54.dir - path: /mnt/data/raw - cache: true - metric: false - persist: false + - md5: 0ee0a6bc0a1f1be0610f7a3f67f1cb54.dir + path: /mnt/data/raw + cache: true + metric: false + persist: false +``` +```dvc $ cat clean.dvc +``` + +```yaml md5: 2b842ed58b1792dde6df27e3d0f73430 cmd: cp -a /mnt/data/raw /mnt/data/clean wdir: . deps: -- md5: 0ee0a6bc0a1f1be0610f7a3f67f1cb54.dir - path: /mnt/data/raw + - md5: 0ee0a6bc0a1f1be0610f7a3f67f1cb54.dir + path: /mnt/data/raw outs: -- md5: 0ee0a6bc0a1f1be0610f7a3f67f1cb54.dir - path: /mnt/data/clean - cache: true - metric: false - persist: false + - md5: 0ee0a6bc0a1f1be0610f7a3f67f1cb54.dir + path: /mnt/data/clean + cache: true + metric: false + persist: false ``` You can also check and verify that indeed all the data and cache files are @@ -144,8 +167,7 @@ $ ls /mnt/data/ clean dvc-cache raw $ ls /mnt/data/dvc-cache -. . . - +... ``` Now you can add and commit the DVC-files to git: @@ -154,3 +176,32 @@ Now you can add and commit the DVC-files to git: $ git add raw.dvc clean.dvc $ git commit -m "cleanup raw data" ``` + +
+ +### Optimizing the data management + +Since we are talking about large data, it is worth spending some time for +understanding +[how DVC can optimize data management](/doc/user-guide/large-dataset-optimization), +so that it does not make unnecessary copies of large data. + +In short, if your external drive is formatted with XFS, Btrfs, ZFS, or any other +file system that supports reflinks, DVC will automatically use the +most efficient way of handling large datasets, and there is no further +configuration that needs to be done. + +If _reflinks_ are not available, then you should consider setting the cache type +to _symlink_ or _hardlink_, like so: + +```dvc +$ dvc config cache.type "reflink,symlink,hardlink,copy" +$ dvc config cache.protected true +``` + +However this implies that for data files that are added to the project with +`dvc add `, you may need to run `dvc unprotect ` before +modifying them. For more details make sure to read the man page of +[dvc unprotect](/doc/commands-reference/unprotect). + +
From 50e5bcb1cf48bd94e35f232564437301c6773ee4 Mon Sep 17 00:00:00 2001 From: Dashamir Hoxha Date: Mon, 19 Aug 2019 19:40:57 +0200 Subject: [PATCH 04/11] Rename the file; add it to sidebar.json --- src/Documentation/sidebar.json | 4 ++++ ...-drive.md => data-storage-on-external-drive.md} | 14 +++++++------- 2 files changed, 11 insertions(+), 7 deletions(-) rename static/docs/use-cases/{huge-data-on-an-external-drive.md => data-storage-on-external-drive.md} (92%) diff --git a/src/Documentation/sidebar.json b/src/Documentation/sidebar.json index a44849b305..ae02fda62f 100644 --- a/src/Documentation/sidebar.json +++ b/src/Documentation/sidebar.json @@ -46,6 +46,10 @@ "label": "Share Data & Model Files", "slug": "share-data-and-model-files" }, + { + "label": "Data Storage On External Drive", + "slug": "data-storage-on-external-drive" + }, { "label": "Shared Development Machine", "slug": "multiple-data-scientists-on-a-single-machine" diff --git a/static/docs/use-cases/huge-data-on-an-external-drive.md b/static/docs/use-cases/data-storage-on-external-drive.md similarity index 92% rename from static/docs/use-cases/huge-data-on-an-external-drive.md rename to static/docs/use-cases/data-storage-on-external-drive.md index 7f05dec3a3..e5cdbb1faa 100644 --- a/static/docs/use-cases/huge-data-on-an-external-drive.md +++ b/static/docs/use-cases/data-storage-on-external-drive.md @@ -1,12 +1,12 @@ -# Managing Data Storage On An External Hard Drive +# Data Storage On External Hard Drive -Sometimes the data may be huge and stored on an +Sometimes the data may be stored on an [external hard drive](https://whatis.techtarget.com/definition/external-hard-drive). -By "huge" we mean that they won't fit on our home directory, and even if they -did, it would certainly take a long time to copy them back an forth from the -external drive to our home directory. For example let's say that the data are -stored on an external hard drive of size 16TB, while the hard drive of our home -directory has a size of only 320GB. +Usually such data are huge, which means that they won't fit on our home +directory, and even if they did, it would certainly take a long time to copy +them back and forth from the external drive to the internal one. For example +let's say that the size of the external drive is 16TB, while the hard drive of +our home directory is only 320GB. In this case we would like to process the data where they are located (on the external drive). We also would like to save the results there, and certainly to From 9a999a7b05b2293fa6d3781e19662eb2488b8fca Mon Sep 17 00:00:00 2001 From: Dashamir Hoxha Date: Tue, 20 Aug 2019 17:28:37 +0200 Subject: [PATCH 05/11] Corrections --- static/docs/use-cases/data-storage-on-external-drive.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/static/docs/use-cases/data-storage-on-external-drive.md b/static/docs/use-cases/data-storage-on-external-drive.md index e5cdbb1faa..fe69198a3f 100644 --- a/static/docs/use-cases/data-storage-on-external-drive.md +++ b/static/docs/use-cases/data-storage-on-external-drive.md @@ -1,4 +1,4 @@ -# Data Storage On External Hard Drive +# Data Storage on External Drive Sometimes the data may be stored on an [external hard drive](https://whatis.techtarget.com/definition/external-hard-drive). @@ -92,8 +92,8 @@ Now, when you refer to the data files and directories, you have to use their **absolute path**. The DVC-files will be created on the project directory, and you can track their modifications with `git` as usual. -For example let's say that the raw data file are on `/mnt/data/raw/` and you are -cleaning them up. You could do it like this: +For example let's say that the raw data files are on `/mnt/data/raw/` and you +are cleaning them up. You could do it like this: ```dvc $ dvc add /mnt/data/raw @@ -174,7 +174,7 @@ Now you can add and commit the DVC-files to git: ```dvc $ git add raw.dvc clean.dvc -$ git commit -m "cleanup raw data" +$ git commit -m "Cleanup raw data" ```
From 059042863f7555414b6526483731c8b110a9382c Mon Sep 17 00:00:00 2001 From: Dashamir Hoxha Date: Tue, 20 Aug 2019 17:58:13 +0200 Subject: [PATCH 06/11] Add a section for similar cases --- .../docs/use-cases/data-storage-on-external-drive.md | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/static/docs/use-cases/data-storage-on-external-drive.md b/static/docs/use-cases/data-storage-on-external-drive.md index fe69198a3f..3e9f766668 100644 --- a/static/docs/use-cases/data-storage-on-external-drive.md +++ b/static/docs/use-cases/data-storage-on-external-drive.md @@ -205,3 +205,14 @@ modifying them. For more details make sure to read the man page of [dvc unprotect](/doc/commands-reference/unprotect).
+ +## Similar cases + +If instead of an external drive we have a +[network-attached storage(NAS)](https://searchstorage.techtarget.com/definition/network-attached-storage) +mounted on the directory `/mnt/data/` (through NFS, Samba, etc.), the solution +would be the same. + +However, in this case the data are most probably used by a team of people, so +make sure to check also the case of +[Shared Development Server](/doc/use-cases/multiple-data-scientists-on-a-single-machine). From a71a0f073d57306e489b387c27fa54f710d0b97b Mon Sep 17 00:00:00 2001 From: Dashamir Hoxha Date: Thu, 22 Aug 2019 23:09:45 +0200 Subject: [PATCH 07/11] Make data singular --- .../data-storage-on-external-drive.md | 20 +++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/static/docs/use-cases/data-storage-on-external-drive.md b/static/docs/use-cases/data-storage-on-external-drive.md index 3e9f766668..83edd880bf 100644 --- a/static/docs/use-cases/data-storage-on-external-drive.md +++ b/static/docs/use-cases/data-storage-on-external-drive.md @@ -2,18 +2,18 @@ Sometimes the data may be stored on an [external hard drive](https://whatis.techtarget.com/definition/external-hard-drive). -Usually such data are huge, which means that they won't fit on our home -directory, and even if they did, it would certainly take a long time to copy -them back and forth from the external drive to the internal one. For example -let's say that the size of the external drive is 16TB, while the hard drive of -our home directory is only 320GB. +Usually such data is huge, which means that it won't fit on our home directory, +and even if it did, it would certainly take a long time to copy it back and +forth from the external drive to the internal one. For example let's say that +the size of the external drive is 16TB, while the hard drive of our home +directory is only 320GB. -In this case we would like to process the data where they are located (on the +In this case we would like to process the data where it is located (on the external drive). We also would like to save the results there, and certainly to store the cached files there as well. -The most easy way to do this would be to locate the workspace on -the external drive itself. If we assume that the external drive is mounted on +The most easy way to do this would be to initialize the workspace +on the external drive itself. If we assume that the external drive is mounted on `/mnt/data/`, then it could be done like this: ```dvc @@ -66,7 +66,7 @@ $ git commit -m 'Initialize DVC with external cache' ### Transfer the content of the cache to the external directory -In this example we are just removing the default cache directory `.dvc/cache/` +In this example we are removing the default cache directory `.dvc/cache/` because we just initialized the project and we know that it is empty (there's nothing stored in it). If we had an existing project, we could preserve the content of the cache by moving it to the new directory: @@ -213,6 +213,6 @@ If instead of an external drive we have a mounted on the directory `/mnt/data/` (through NFS, Samba, etc.), the solution would be the same. -However, in this case the data are most probably used by a team of people, so +However, in this case the data is most probably used by a team of people, so make sure to check also the case of [Shared Development Server](/doc/use-cases/multiple-data-scientists-on-a-single-machine). From 482e1c7df1c2dab5379394e9127e617c82c023fb Mon Sep 17 00:00:00 2001 From: Dashamir Hoxha Date: Thu, 22 Aug 2019 23:16:29 +0200 Subject: [PATCH 08/11] Replace everywhere '/mnt/data' with '/mnt/external-drive' --- .../data-storage-on-external-drive.md | 51 ++++++++++--------- 1 file changed, 26 insertions(+), 25 deletions(-) diff --git a/static/docs/use-cases/data-storage-on-external-drive.md b/static/docs/use-cases/data-storage-on-external-drive.md index 83edd880bf..dab6b9a5b7 100644 --- a/static/docs/use-cases/data-storage-on-external-drive.md +++ b/static/docs/use-cases/data-storage-on-external-drive.md @@ -18,7 +18,7 @@ on the external drive itself. If we assume that the external drive is mounted on ```dvc $ sudo su -# cd /mnt/data/ +# cd /mnt/external-drive/ # git init # dvc init ``` @@ -30,21 +30,21 @@ staying on the external drive. DVC will still be able to track them properly. ## Make the data directory accessible For this to work, first you have to make sure that you can read and write the -data directory `/mnt/data/`. The most straightforward way to do this is by -setting proper ownership and permissions to it, like this: +data directory `/mnt/external-drive/`. The most straightforward way to do this +is by setting proper ownership and permissions to it, like this: ```dvc -$ sudo chown : -R /mnt/data/ -$ chmod u+rw -R /mnt/data/ +$ sudo chown : -R /mnt/external-drive/ +$ chmod u+rw -R /mnt/external-drive/ ``` ## Start a DVC project and setup an external cache An _external_ cache is called so because it resides outside of the -workspace directory. Let's create a directory for it on `/mnt/data/`: +workspace directory. Let's create a directory for it on `/mnt/external-drive/`: ```dvc -$ mkdir -p /mnt/data/dvc-cache +$ mkdir -p /mnt/external-drive/dvc-cache ``` Now you can initialize a project on your home directory and @@ -55,7 +55,7 @@ $ cd ~/project/ $ git init $ dvc init -$ dvc config cache.dir /mnt/data/dvc-cache +$ dvc config cache.dir /mnt/external-drive/dvc-cache $ rm -rf .dvc/cache/ $ git add .dvc/config @@ -72,7 +72,7 @@ nothing stored in it). If we had an existing project, we could preserve the content of the cache by moving it to the new directory: ```dvc -$ mv -a .dvc/cache/* /mnt/data/dvc-cache/ +$ mv -a .dvc/cache/* /mnt/external-drive/dvc-cache/ $ rm -rf .dvc/cache/ ``` @@ -83,7 +83,7 @@ If you check the config file you should see something like this: ```dvc $ cat .dvc/config [cache] -dir = /mnt/data/dvc-cache +dir = /mnt/external-drive/dvc-cache ``` ## Tracking external dependencies and outputs @@ -92,16 +92,16 @@ Now, when you refer to the data files and directories, you have to use their **absolute path**. The DVC-files will be created on the project directory, and you can track their modifications with `git` as usual. -For example let's say that the raw data files are on `/mnt/data/raw/` and you -are cleaning them up. You could do it like this: +For example let's say that the raw data files are on `/mnt/external-drive/raw/` +and you are cleaning them up. You could do it like this: ```dvc -$ dvc add /mnt/data/raw +$ dvc add /mnt/external-drive/raw $ dvc run -f clean.dvc \ - -d /mnt/data/raw \ - -o /mnt/data/clean \ - ./cleanup.py /mnt/data/raw /mnt/data/clean + -d /mnt/external-drive/raw \ + -o /mnt/external-drive/clean \ + ./cleanup.py /mnt/external-drive/raw /mnt/external-drive/clean ```
@@ -109,7 +109,8 @@ $ dvc run -f clean.dvc \ ### Using an environment variable for the data path In a real life situation probably you would declare an environment variable -`DATA_PATH=/mnt/data` and use it to shorten the command options, like this: +`DATA_PATH=/mnt/external-drive` and use it to shorten the command options, like +this: ```dvc $ dvc add $DATA_PATH/raw @@ -134,7 +135,7 @@ md5: 9cbbacd47133debf91dcb41891c64730 wdir: . outs: - md5: 0ee0a6bc0a1f1be0610f7a3f67f1cb54.dir - path: /mnt/data/raw + path: /mnt/external-drive/raw cache: true metric: false persist: false @@ -146,14 +147,14 @@ $ cat clean.dvc ```yaml md5: 2b842ed58b1792dde6df27e3d0f73430 -cmd: cp -a /mnt/data/raw /mnt/data/clean +cmd: cp -a /mnt/external-drive/raw /mnt/external-drive/clean wdir: . deps: - md5: 0ee0a6bc0a1f1be0610f7a3f67f1cb54.dir - path: /mnt/data/raw + path: /mnt/external-drive/raw outs: - md5: 0ee0a6bc0a1f1be0610f7a3f67f1cb54.dir - path: /mnt/data/clean + path: /mnt/external-drive/clean cache: true metric: false persist: false @@ -163,10 +164,10 @@ You can also check and verify that indeed all the data and cache files are stored on the external drive: ```dvc -$ ls /mnt/data/ +$ ls /mnt/external-drive/ clean dvc-cache raw -$ ls /mnt/data/dvc-cache +$ ls /mnt/external-drive/dvc-cache ... ``` @@ -210,8 +211,8 @@ modifying them. For more details make sure to read the man page of If instead of an external drive we have a [network-attached storage(NAS)](https://searchstorage.techtarget.com/definition/network-attached-storage) -mounted on the directory `/mnt/data/` (through NFS, Samba, etc.), the solution -would be the same. +mounted on the directory `/mnt/external-drive/` (through NFS, Samba, etc.), the +solution would be the same. However, in this case the data is most probably used by a team of people, so make sure to check also the case of From abd24462c4b67e1e207a6b8a8ccf986fca81d502 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Tue, 22 Oct 2019 20:08:36 -0500 Subject: [PATCH 09/11] use-cases: addressing all my own feedback in #565 for new data-storage-on-external-drive case --- .../data-storage-on-external-drive.md | 43 ++++++++++--------- 1 file changed, 23 insertions(+), 20 deletions(-) diff --git a/static/docs/use-cases/data-storage-on-external-drive.md b/static/docs/use-cases/data-storage-on-external-drive.md index dab6b9a5b7..3416bb6f39 100644 --- a/static/docs/use-cases/data-storage-on-external-drive.md +++ b/static/docs/use-cases/data-storage-on-external-drive.md @@ -2,18 +2,17 @@ Sometimes the data may be stored on an [external hard drive](https://whatis.techtarget.com/definition/external-hard-drive). -Usually such data is huge, which means that it won't fit on our home directory, -and even if it did, it would certainly take a long time to copy it back and -forth from the external drive to the internal one. For example let's say that -the size of the external drive is 16TB, while the hard drive of our home -directory is only 320GB. - -In this case we would like to process the data where it is located (on the -external drive). We also would like to save the results there, and certainly to -store the cached files there as well. - -The most easy way to do this would be to initialize the workspace -on the external drive itself. If we assume that the external drive is mounted on +Usually such data is huge, which means that it won't fit on our local drive, and +even if it did, it would certainly take a long time to copy it back and forth +from the external drive to the internal one. For example let's say that the size +of the external drive is 16TB, while the local drive is only 320GB. + +In this case we would like to process the data where it is already located (on +the external drive). We also would like to save the results there, and certainly +to store the cached files there as well. + +The easiest way to do this would be to initialize the workspace on +the external drive itself. If we assume that the external drive is mounted on `/mnt/data/`, then it could be done like this: ```dvc @@ -24,8 +23,8 @@ $ sudo su ``` But in case this is not possible (or is not preferable), we can easily setup the -workspace in our home directory, while all the data files and their caches keep -staying on the external drive. DVC will still be able to track them properly. +workspace in our local drive, while all the data files and their caches stay on +the external drive. DVC will still be able to track them properly. ## Make the data directory accessible @@ -38,10 +37,15 @@ $ sudo chown : -R /mnt/external-drive/ $ chmod u+rw -R /mnt/external-drive/ ``` +> Or refer to +> [User Account Control](https://docs.microsoft.com/en-us/windows/security/identity-protection/user-account-control/user-account-control-overview) +> for Windows. + ## Start a DVC project and setup an external cache -An _external_ cache is called so because it resides outside of the -workspace directory. Let's create a directory for it on `/mnt/external-drive/`: +An [external cache](/doc/user-guide/external-outputs) is called so because it +resides outside of the workspace directory. Let's create a directory for it on +`/mnt/external-drive/`: ```dvc $ mkdir -p /mnt/external-drive/dvc-cache @@ -56,7 +60,6 @@ $ git init $ dvc init $ dvc config cache.dir /mnt/external-drive/dvc-cache -$ rm -rf .dvc/cache/ $ git add .dvc/config $ git commit -m 'Initialize DVC with external cache' @@ -69,7 +72,7 @@ $ git commit -m 'Initialize DVC with external cache' In this example we are removing the default cache directory `.dvc/cache/` because we just initialized the project and we know that it is empty (there's nothing stored in it). If we had an existing project, we could preserve the -content of the cache by moving it to the new directory: +content of the cache by moving it to the new directory: ```dvc $ mv -a .dvc/cache/* /mnt/external-drive/dvc-cache/ @@ -174,8 +177,8 @@ $ ls /mnt/external-drive/dvc-cache Now you can add and commit the DVC-files to git: ```dvc -$ git add raw.dvc clean.dvc -$ git commit -m "Cleanup raw data" +$ git add . +$ git commit -m 'Cleanup raw data' ```
From b3d5c6b985e74917e2e8d517e5e02c8828f7b6f6 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Tue, 22 Oct 2019 20:12:18 -0500 Subject: [PATCH 10/11] use-cases: improve DVC-file explanation per https://github.com/iterative/dvc.org/pull/565#pullrequestreview-304189595 --- static/docs/use-cases/data-storage-on-external-drive.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/static/docs/use-cases/data-storage-on-external-drive.md b/static/docs/use-cases/data-storage-on-external-drive.md index 3416bb6f39..066a4d68a4 100644 --- a/static/docs/use-cases/data-storage-on-external-drive.md +++ b/static/docs/use-cases/data-storage-on-external-drive.md @@ -126,8 +126,8 @@ $ dvc run -f clean.dvc \
-If you check the contents of the files `raw.dvc` and `clean.dvc` you will notice -that their `path:` field refers to the external directories: +If you check the contents of `raw.dvc` (and `clean.dvc`) you'll notice that the +`path` field refers to the external directories: ```dvc $ cat raw.dvc From f345245f4285f64d65c7e7fc8f27436b27cabef4 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Tue, 22 Oct 2019 20:14:03 -0500 Subject: [PATCH 11/11] use-cases: remove unnecssary code blocks also per https://github.com/iterative/dvc.org/pull/565#pullrequestreview-304189595 --- .../data-storage-on-external-drive.md | 23 ------------------- 1 file changed, 23 deletions(-) diff --git a/static/docs/use-cases/data-storage-on-external-drive.md b/static/docs/use-cases/data-storage-on-external-drive.md index 066a4d68a4..32d40ec2c7 100644 --- a/static/docs/use-cases/data-storage-on-external-drive.md +++ b/static/docs/use-cases/data-storage-on-external-drive.md @@ -129,10 +129,6 @@ $ dvc run -f clean.dvc \ If you check the contents of `raw.dvc` (and `clean.dvc`) you'll notice that the `path` field refers to the external directories: -```dvc -$ cat raw.dvc -``` - ```yaml md5: 9cbbacd47133debf91dcb41891c64730 wdir: . @@ -144,25 +140,6 @@ outs: persist: false ``` -```dvc -$ cat clean.dvc -``` - -```yaml -md5: 2b842ed58b1792dde6df27e3d0f73430 -cmd: cp -a /mnt/external-drive/raw /mnt/external-drive/clean -wdir: . -deps: - - md5: 0ee0a6bc0a1f1be0610f7a3f67f1cb54.dir - path: /mnt/external-drive/raw -outs: - - md5: 0ee0a6bc0a1f1be0610f7a3f67f1cb54.dir - path: /mnt/external-drive/clean - cache: true - metric: false - persist: false -``` - You can also check and verify that indeed all the data and cache files are stored on the external drive: