I am dealing with a large hierarchical data set. One where artifacts are pulled from various directories to generate contiguous data sets that are then fed to ML processes downstream. I don't want to use dvc to reproduce the pipeline, at least not yet. My needs are rather to be able to version the overall image dataset hierarchy, for the purpose of manual inspection of the whole hierarchy and moving images into groups or removing them altogether when necessary.
This enables folks with less ML expertise control the data set they want to build by grouping the content together that they want to pick up when generating the data set. The data set is not a list of images, rather it is a list of lower dimensional feature vectors extracted from those images.
I'm finding dvc taking a potentially unreasonable amount of time to just add and commit. Perhaps I don't understand what I'm doing or haven't set my expectations correctly.
I wanted to keep these operations small in order to ensure things were working well. I have done the following. I have approximately 300K in total in this set right now.
- store 60K images on local file system, under the data/ directory.
- dvc add data/
- dvc push -r remote. I forgot to commit here since things took so long and I wanted to see if pushing worked.
- store 120K additional images to another sub directory under the data/ directory.
- dvc add data/ -> goes through all of the files in data/ regardless. I ran -v here and showed the previous files.
- dvc push -r remote.
- dvc commit. Here dvc is taking the greater amount of 99% of system memory (13 GB) and appears to be causing disk thrashing. It's been running nearly for a day so far.
I am just looking for some guidance in managing a dataset of this nature using dvc in a way that will not eat up so much time, disk, compute, etc. If I'm doing something suboptimal, then I want to shine some light on that.
I am dealing with a large hierarchical data set. One where artifacts are pulled from various directories to generate contiguous data sets that are then fed to ML processes downstream. I don't want to use dvc to reproduce the pipeline, at least not yet. My needs are rather to be able to version the overall image dataset hierarchy, for the purpose of manual inspection of the whole hierarchy and moving images into groups or removing them altogether when necessary.
This enables folks with less ML expertise control the data set they want to build by grouping the content together that they want to pick up when generating the data set. The data set is not a list of images, rather it is a list of lower dimensional feature vectors extracted from those images.
I'm finding dvc taking a potentially unreasonable amount of time to just add and commit. Perhaps I don't understand what I'm doing or haven't set my expectations correctly.
I wanted to keep these operations small in order to ensure things were working well. I have done the following. I have approximately 300K in total in this set right now.
I am just looking for some guidance in managing a dataset of this nature using dvc in a way that will not eat up so much time, disk, compute, etc. If I'm doing something suboptimal, then I want to shine some light on that.