DVC will sometimes fail when adding a file it is already tracking.
This happens in the case where:
- the cache dir is external, i.e. outside of the DVC repo
- the cache type is symlink
Adding a file to DVC will fail if it is added a second time. This is because the file becomes a symlink into that eternal cache dir and DVC thinks that the file does not belong to the DVC repo.
Reproduce
BASE_DPATH=$HOME/tmp/dvc_add_issue
# Create a clean start directory
rm -rf "$BASE_DPATH"
mkdir -p "$BASE_DPATH"
cd "$BASE_DPATH"
mkdir -p "$BASE_DPATH"/external_cache
# Make a simple repo
mkdir -p "$BASE_DPATH/demo_repo"
cd "$BASE_DPATH/demo_repo"
git init --quiet
dvc init --quiet
dvc config cache.type "symlink"
# Set the cache to an external dir
dvc cache dir "$BASE_DPATH"/external_cache
mkdir -p eval/expt1/params1
echo '["data"]' > eval/expt1/params1/summary.json
dvc add eval/*/*/summary.json
ls -al eval/*/*/summary.json
# Adding again will fail if we have an external DVC cache
dvc add eval/*/*/summary.json
This results in
ERROR: Output(s) outside of DVC project: eval/expt1/params1/summary.json. See <https://dvc.org/doc/user-guide/managing-external-data> for more info.
Expected
If we comment out the line that sets the cache dir, the previous MWE will work
Output of dvc doctor:
DVC version: 2.9.5 (pip)
---------------------------------
Platform: Python 3.9.9 on Linux-5.13.0-30-generic-x86_64-with-glibc2.31
Supports:
azure (adlfs = 2021.10.0, knack = 0.9.0, azure-identity = 1.7.1),
gdrive (pydrive2 = 1.10.0),
gs (gcsfs = 2021.11.1),
hdfs (fsspec = 2021.11.1, pyarrow = 6.0.1),
webhdfs (fsspec = 2021.11.1),
http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
s3 (s3fs = 2021.11.1, boto3 = 1.19.8),
ssh (sshfs = 2021.11.2),
oss (ossfs = 2021.8.0),
webdav (webdav4 = 0.9.3),
webdavs (webdav4 = 0.9.3)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/md127
Caches: local
Remotes: ssh, s3, ssh
Workspace directory: ext4 on /dev/md127
Repo: dvc, git
Additional Info
This is a pain because I have a bunch of files that I'm individually tracking and I use a glob pattern to add them all. When I have an external cache dir I have to use a workaround where I pass the glob string to a python program that removes the symlinks and then sends the rest of the files I want to track to add.
DVC will sometimes fail when adding a file it is already tracking.
This happens in the case where:
Adding a file to DVC will fail if it is added a second time. This is because the file becomes a symlink into that eternal cache dir and DVC thinks that the file does not belong to the DVC repo.
Reproduce
This results in
Expected
If we comment out the line that sets the cache dir, the previous MWE will work
Output of
dvc doctor:Additional Info
This is a pain because I have a bunch of files that I'm individually tracking and I use a glob pattern to add them all. When I have an external cache dir I have to use a workaround where I pass the glob string to a python program that removes the symlinks and then sends the rest of the files I want to track to add.