This is more of a question - related to setting up data registries and the implications of shared cache with dvc import.
Presently I have a few datasets - each created as a separate git/dvc project (each say in the 1000GB range).
Each dataset contains a group of specific images, along with several different annotations types.
Each dataset has been configured to use a separate (independent) shared cache on network attached storage - visible to several shared development servers(s)
/network/storage/shared_dvc/cache/project_A
/network/storage/shared_dvc/cache/project_B
/network/storage/shared_dvc/cache/project_C
This part is working.
Now the question arises from consuming these registries - with a 4th project (project_D). This project contains the code defining a DL network and training script.. The network consumes a composite of information contained in registries project_B and project_C ( accomplished with dvc import )
It would seem unnecessary to duplicate the cache storage.
- Is there a way to share the existing caches for project_B and project_C?
- Should all these independent DVC/git projects be configured to use the same cache dir?
- Do we setup a shared cache for project_D - which will have its own independent shared cache/copy, duplicating a subset of project_B and project_C + whatever we are tracking in D?
The datasets eat up storage fairly quickly - looking for guidance to minimize the impact of duplicate copies
This is more of a question - related to setting up data registries and the implications of shared cache with dvc import.
Presently I have a few datasets - each created as a separate git/dvc project (each say in the 1000GB range).
Each dataset contains a group of specific images, along with several different annotations types.
Each dataset has been configured to use a separate (independent) shared cache on network attached storage - visible to several shared development servers(s)
/network/storage/shared_dvc/cache/project_A
/network/storage/shared_dvc/cache/project_B
/network/storage/shared_dvc/cache/project_C
This part is working.
Now the question arises from consuming these registries - with a 4th project (project_D). This project contains the code defining a DL network and training script.. The network consumes a composite of information contained in registries project_B and project_C ( accomplished with dvc import )
It would seem unnecessary to duplicate the cache storage.
The datasets eat up storage fairly quickly - looking for guidance to minimize the impact of duplicate copies