I've been working on a large project with multiple datasets. One of these datasets is large (>100 GB). If I simply run dvc pull, then it will pull the huge dataset, which takes up most available disk space on my machine.
The only way around this appears to be providing the file name to every data file to download. This is inconvenient, however, because there are many files I do want, and only one that I don't want.
I see two solutions to this:
- Allow named file groups. The user could specify groups of files in some sort of config, and pull them individually by name. I.e.,
dvc pull mnist. The user would also be able to exclude them: dvc pull all --exclude mnist.
- Allow exclusion of certain files from the command line. I.e.,
dvc pull --exclude data/mnist.dvc.
I've been working on a large project with multiple datasets. One of these datasets is large (>100 GB). If I simply run
dvc pull, then it will pull the huge dataset, which takes up most available disk space on my machine.The only way around this appears to be providing the file name to every data file to download. This is inconvenient, however, because there are many files I do want, and only one that I don't want.
I see two solutions to this:
dvc pull mnist. The user would also be able to exclude them:dvc pull all --exclude mnist.dvc pull --exclude data/mnist.dvc.