Skip to content

DVC fails adding a second time when the cache dir is external #7601

@Erotemic

Description

@Erotemic

DVC will sometimes fail when adding a file it is already tracking.

This happens in the case where:

  1. the cache dir is external, i.e. outside of the DVC repo
  2. the cache type is symlink

Adding a file to DVC will fail if it is added a second time. This is because the file becomes a symlink into that eternal cache dir and DVC thinks that the file does not belong to the DVC repo.

Reproduce

BASE_DPATH=$HOME/tmp/dvc_add_issue

# Create a clean start directory
rm -rf "$BASE_DPATH"
mkdir -p "$BASE_DPATH"
cd "$BASE_DPATH"

mkdir -p "$BASE_DPATH"/external_cache

# Make a simple repo
mkdir -p "$BASE_DPATH/demo_repo"
cd "$BASE_DPATH/demo_repo"
git init --quiet 
dvc init --quiet
dvc config cache.type "symlink"

# Set the cache to an external dir
dvc cache dir "$BASE_DPATH"/external_cache

mkdir -p eval/expt1/params1
echo '["data"]' > eval/expt1/params1/summary.json
dvc add eval/*/*/summary.json
ls -al eval/*/*/summary.json

# Adding again will fail if we have an external DVC cache
dvc add eval/*/*/summary.json

This results in

ERROR: Output(s) outside of DVC project: eval/expt1/params1/summary.json. See <https://dvc.org/doc/user-guide/managing-external-data> for more info.

Expected

If we comment out the line that sets the cache dir, the previous MWE will work

Output of dvc doctor:

DVC version: 2.9.5 (pip)
---------------------------------
Platform: Python 3.9.9 on Linux-5.13.0-30-generic-x86_64-with-glibc2.31
Supports:
	azure (adlfs = 2021.10.0, knack = 0.9.0, azure-identity = 1.7.1),
	gdrive (pydrive2 = 1.10.0),
	gs (gcsfs = 2021.11.1),
	hdfs (fsspec = 2021.11.1, pyarrow = 6.0.1),
	webhdfs (fsspec = 2021.11.1),
	http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
	https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
	s3 (s3fs = 2021.11.1, boto3 = 1.19.8),
	ssh (sshfs = 2021.11.2),
	oss (ossfs = 2021.8.0),
	webdav (webdav4 = 0.9.3),
	webdavs (webdav4 = 0.9.3)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/md127
Caches: local
Remotes: ssh, s3, ssh
Workspace directory: ext4 on /dev/md127
Repo: dvc, git

Additional Info

This is a pain because I have a bunch of files that I'm individually tracking and I use a glob pattern to add them all. When I have an external cache dir I have to use a workaround where I pass the glob string to a python program that removes the symlinks and then sends the rest of the files I want to track to add.

Metadata

Metadata

Assignees

No one assigned

    Labels

    A: data-managementRelated to dvc add/checkout/commit/move/removebugDid we break something?

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions