A 1.7Gb dataset containing 600k files takes many hours with dvc add or dvc push/pull.
Description
Hi all, we have been using dvc for a while on medium size datasets, but struggle when trying to use it with big ones. We are unsure if it is due to our poor use of the tool or if it is a real bug.
We have a dataset containing about 600k small files for a total of 1.7Gb. The repo is configured to be stored on s3. Updating the whole dataset with aws s3 cp to upload this dataset takes only a few minutes.
We add two problems and are unsure if that's a DVC problem or if it's due to us missusing the tool.:
-
If we run dvc add /path/to/dataset it takes a few tens of minutes to add, which is okay I think as all md5 must be computed.
Then if we run dvc status, it goes fast. However, if we change a single file, all md5 are recomputed and just checking the status takes ages again.
We sort of solved that problem as the dataset is composed of many smaller subfolders, so instead, we add files with dvc add /path/to/dataset/*/*. Then, everything is fine, but this seems quite odd. Is there a better way to do it ?
-
Once we are able to add, the problem is to do dvc push/pull. This takes between 6 and 8 hours, which seems too much for less than 2Gb. It seems that dvc is uploading every file separately ? Are we doing something wrong ?+
Reproduce
I cannot put my own dataset but it weights 1.7Gb and contains 600k files splitted in subfolders which themselves have subfolders.
dvc remote add -d origin path/to/s3/dir
dvc add /path/to/dataset/*/*
git commit -am 'test'
dvc push
Expected
Uploading files to s3 should be reasonably faster I think ?
Environment information
Output of dvc doctor:
DVC version: 2.9.5 (deb)
Platform: Python 3.8.3 on Linux-5.4.0-107-generic-x86_64-with-glibc2.14
Supports:
azure (adlfs = 2022.2.0, knack = 0.9.0, azure-identity = 1.7.1),
gdrive (pydrive2 = 1.10.0),
gs (gcsfs = 2022.1.0),
hdfs (fsspec = 2022.1.0, pyarrow = 7.0.0),
webhdfs (fsspec = 2022.1.0),
http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
s3 (s3fs = 2022.1.0, boto3 = 1.20.24),
ssh (sshfs = 2021.11.2),
oss (ossfs = 2021.8.0),
webdav (webdav4 = 0.9.4),
webdavs (webdav4 = 0.9.4)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/nvme0n1p2
Caches: local
Remotes: s3
Workspace directory: ext4 on /dev/nvme0n1p2
Additional Information (if any):
A 1.7Gb dataset containing 600k files takes many hours with dvc add or dvc push/pull.
Description
Hi all, we have been using dvc for a while on medium size datasets, but struggle when trying to use it with big ones. We are unsure if it is due to our poor use of the tool or if it is a real bug.
We have a dataset containing about 600k small files for a total of 1.7Gb. The repo is configured to be stored on s3. Updating the whole dataset with aws s3 cp to upload this dataset takes only a few minutes.
We add two problems and are unsure if that's a DVC problem or if it's due to us missusing the tool.:
If we run
dvc add /path/to/datasetit takes a few tens of minutes to add, which is okay I think as all md5 must be computed.Then if we run dvc status, it goes fast. However, if we change a single file, all md5 are recomputed and just checking the status takes ages again.
We sort of solved that problem as the dataset is composed of many smaller subfolders, so instead, we add files with
dvc add /path/to/dataset/*/*. Then, everything is fine, but this seems quite odd. Is there a better way to do it ?Once we are able to add, the problem is to do
dvc push/pull. This takes between 6 and 8 hours, which seems too much for less than 2Gb. It seems that dvc is uploading every file separately ? Are we doing something wrong ?+Reproduce
I cannot put my own dataset but it weights 1.7Gb and contains 600k files splitted in subfolders which themselves have subfolders.
Expected
Uploading files to s3 should be reasonably faster I think ?
Environment information
Output of
dvc doctor:$ dvc doctorDVC version: 2.9.5 (deb)
Platform: Python 3.8.3 on Linux-5.4.0-107-generic-x86_64-with-glibc2.14
Supports:
azure (adlfs = 2022.2.0, knack = 0.9.0, azure-identity = 1.7.1),
gdrive (pydrive2 = 1.10.0),
gs (gcsfs = 2022.1.0),
hdfs (fsspec = 2022.1.0, pyarrow = 7.0.0),
webhdfs (fsspec = 2022.1.0),
http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
s3 (s3fs = 2022.1.0, boto3 = 1.20.24),
ssh (sshfs = 2021.11.2),
oss (ossfs = 2021.8.0),
webdav (webdav4 = 0.9.4),
webdavs (webdav4 = 0.9.4)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/nvme0n1p2
Caches: local
Remotes: s3
Workspace directory: ext4 on /dev/nvme0n1p2
Additional Information (if any):