Decoupling dvcignore and fs#5812
Conversation
|
Currently affected files ( with at least one Can be classified into five levels
Originally
Currently the behavior of would throw an exception And in documents
For comparison, in Git Should we keep our current logic or switch to a more Git-like one? BTW, even if we decided to switch, it shouldn't be included in this PR. |
|
@efiop
|
|
@karajan1001 Great summary! Indeed, there is an inconsistency where dvcignore could ignore already tracked output, even though it really should behave like
Most of user-facing commands don't really walk into subrepos when walking the host repo, but when given a path that is in subrepo - will indeed go into the subrepo (it is a bit confusing, but it is really a more simple logic), while we do have some tests that check for recursive host walk that has to walk into subrepos as well. The latter one could be considered to be changed.
Hopefully most of such cases could be covered by just filtering out |
|
@efiop, @karajan1001, what are the reasons for the decoupling? IIRC we depend on this behaviour in a lot of places (maybe what we need is a better API but is decoupling/duplication worth it?). |
So, it is not a designed behavior but accidentally behaves like this.
|
|
@karajan1001 I mean that we can change the second part - behaviour that is only tested by us in syntetic tests but not actually used in any user-facing functionality. E.g. an ability to fs.walk("host_root") and go through host repo and into the subrepo could be considered to be dropped, as it is not used by any dvc commands. Have to point out that this is an open-ended task, if you see that the current approach works better - we can keep it. |
|
Current Originally Two solutions:
|
ea8262d to
6770760
Compare
Most of the tests' performance changed less than 4% after this PR. Only two of them show an obvious diff |
Extra time cost comes from two steps.
After solving these two problems. New version runs a bit faster than before. This might because we move subrepo finding from
|
|
Newest result. a bit better than before.
|
7758d1c to
798abbf
Compare
There was a problem hiding this comment.
walk_files should probably be files to avoid confusion with the function.
How about:
| if fs.scheme == Schemes.LOCAL and fs.repo: | |
| walk_files = fs.repo.dvcignore(fs.walk(path_info), walk_files=True) | |
| else: | |
| walk_files = fs.walk_files(path_info) | |
| existing_files = set(walk_files) | |
| files = set(cache.repo.dvcignore(fs, path_info, walk_files=True)) |
unlike fs.repo, cache.repo is more persistent (though temporarily) so we can rely on it more.
Or maybe even:
files = set(cache.repo.dvcignore.walk_files(fs, path_info))
so that dvcignore has walk and walk_files methods.
Also, for the record, I'm refactoring this right now to decouble repo from fs, and fs.repo will go away soon, as well as cache.repo. We'll likely start passing dvcignore(as well as state) to checkout as kwargs. No action needed in this PR, just FYI.
There was a problem hiding this comment.
dvc/objects/stage.py: walk_iterator = odb.repo.dvcignore.walk_files(fs.walk(path_info))
dvc/utils/fs.py: walk_iterator = dvcignore.walk_files(fs.walk(path))
dvc/checkout.py: cache.repo.dvcignore.walk_files(fs.walk(path_info))
dvc/fs/repo.py: yield from repo.dvcignore.walk(
dvc/repo/collect.py: target_infos.extend(repo.dvcignore.walk_files(fs.walk(path_info)))
dvc/repo/add.py: repo.dvcignore.walk_files(repo.fs.walk(target)),
dvc/repo/stage.py: for root, dirs, files in self.repo.dvcignore.walk(
Currently dvcignore are used in these 7 places.
4 of them could be derived directly from repo
dvc/repo/collect.py: target_infos.extend(repo.dvcignore.walk_files(fs.walk(path_info)))
dvc/repo/add.py: repo.dvcignore.walk_files(repo.fs.walk(target)),
dvc/repo/stage.py: for root, dirs, files in self.repo.dvcignore.walk(
dvc/fs/repo.py: yield from repo.dvcignore.walk(
2 come from cache ( temporary solution )
dvc/objects/stage.py: walk_iterator = odb.repo.dvcignore.walk_files(fs.walk(path_info))
dvc/checkout.py: cache.repo.dvcignore.walk_files(fs.walk(path_info))
And I add a repo variable to the class State in the last one.
dvc/utils/fs.py: walk_iterator = dvcignore.walk_files(fs.walk(path))
There was a problem hiding this comment.
In my opinion, maybe we should pass a list of individual files instead of path and file systems other things in the last three cases.
There was a problem hiding this comment.
In my opinion, maybe we should pass a list of individual files instead of path and file systems other things in the last three cases.
Could you elaborate, please?
| jobs=jobs, | ||
| follow_subrepos=False, | ||
| ) | ||
| obj = stage(odb, path_info, repo.repo_fs, "md5", jobs=jobs,) |
There was a problem hiding this comment.
There's an interesting question of whether we should apply dvcignore here instead of relying on it being applied in repo_fs 🤔 For now we could keep it as is, but it might be a good idea for the future.
There was a problem hiding this comment.
@efiop add dvcignore here would cause dvc fetch only downloading those not ignored caches. The problem would be with those caches inside subrepos. Subrepo problems are not solved thoroughly. Maybe it is better to solve it after we had fixed and clear subrepo documents and usage.
There was a problem hiding this comment.
| def get_mtime_and_size(path, fs, dvcignore=None): | ||
| import nanotime | ||
|
|
||
| if fs.isdir(path): |
There was a problem hiding this comment.
This should probably also use dvcignore, right?
There was a problem hiding this comment.
Yes, without it, dvc tracked directories would be regarded as modified after dvcignored changing.
There was a problem hiding this comment.
So should it check that the path is not dvcignored first?
On a related topic, right now we pass dvcignore to the state constructor so we could pass it through to get_mtime_and_size, which has an implication of us always using dvcignore even though we might not want to. So maybe it could be a good idea to pass dvcignore explicitly to State methods that might need it.
State is a bit behind right now on fsspec changes and still uses os.path instead of fs (same as get_mtime_and_size), but that should change in the near future. Thinking if maybe dvcignore as a wrapper (blast from the past 🙂 ) would turn to be more fitting here.
Another note is that DvcignoreFilter right now is combining patter matching and filesystem operations, which has been a source of confusion, as we try to use os.path and introduce a local filesystem bias, which results in us using, say, os.path.isdir in is_ignored instead of fs.isdir
| def isdir(self): | ||
| if self._check_path_dvcignore(self.path_info): | ||
| return False | ||
| return self.fs.isdir(self.path_info) |
There was a problem hiding this comment.
dvcignore can ignore a dir (e.g. dir/) or a file, which are two different cases, so dvcignore will need to call isdir internally using os.path, but we should pass self.fs to it instead and make it use that. Or am I missing something?
There was a problem hiding this comment.
@efiop , dvcignore didn't test where the path is a file or a dir. The only difference of dirs and files are the matching of pattern like paths/.
There was a problem hiding this comment.
@karajan1001 Right, I'm saying that you are using is_ignored which will interpret path as a local path and will try to use os.path.isdir https://github.com/iterative/dvc/blob/e7571ab3251ef4417fa196e7ad45b7f6f8a42ee3/dvc/ignore.py#L354 on it, while it should use fs.isdir, because fs might be a GitFileSystem, for example.
There was a problem hiding this comment.
In general, right now dvcignore should only work for fs(and path) that only belong to dvcignore.fs(filesystem that dvcignore was collected from).
|
@karajan1001 @pared We are in a bit of a mess with legacy |
| is_ignored = self.repo.dvcignore.is_ignored_file(self.path) | ||
| return self.repo.fs.exists(self.path) and not is_ignored |
There was a problem hiding this comment.
Love how we now use is_ignored_file instead of relying on implicit is_ignored(includes is_ignored_dir which we don't care about and can be harmful) inside fs ❤️
|
|
||
|
|
||
| class AzureFileSystem(FSSpecWrapper): | ||
| class AzureFileSystem(FSSpecWrapper): # pylint:disable=abstract-method |
There was a problem hiding this comment.
There is a less dangerous way: you could add walk to FSSpecWrapper.
EDIT: on the other hand we've been using it in other FileSystems for awhile 🙁
| assert set(actual) == set(expected) | ||
| assert len(actual) == len(expected) | ||
|
|
||
| assert fs.isfile(tmp_dir / "lorem") is True | ||
| assert fs.isfile(tmp_dir / "dir" / "repo" / "foo") is False | ||
| assert fs.isdir(tmp_dir / "dir" / "repo") is False | ||
| assert fs.isdir(tmp_dir / "dir") is True | ||
|
|
||
| assert fs.isdvc(tmp_dir / "lorem") is True | ||
| assert fs.isdvc(tmp_dir / "dir" / "repo" / "dir1") is False | ||
|
|
||
| assert fs.exists(tmp_dir / "dir" / "repo.txt") is True | ||
| assert fs.exists(tmp_dir / "repo" / "ipsum") is False | ||
|
|
There was a problem hiding this comment.
Hm, this is actually what we were talking about in #5812 (comment) . Here we have a very strange situation where fs.walk is dvcignore aware, but isdir/isdfile are not, which is bad. So we need to either use dvcignore in both, or just apply dvcignore where it is needed on top of RepoFileSystem, effectively decoupling it and dvcignore.
There was a problem hiding this comment.
Simple fix for now is to just use dvicgnore in RepoFileSystems isfile/isdir/exists, but we might want to think about futher decoupling here in the future or could discuss right away. The change is not going to be simple there because RepoFileSystem was designed that way, but we could give it a shot.
| def isdir(self, path): # pylint: disable=arguments-differ | ||
| fs, dvc_fs = self._get_fs_pair(path) | ||
|
|
||
| if dvc_fs and dvc_fs.repo.dvcignore.is_ignored_dir(path): |
There was a problem hiding this comment.
ignore_subrepos is need here.
|
Had a great discussion with @karajan1001 about dvcignore and related things. To summarize the stuff to consider and handle in the followups:
Merging for now to unblock followups. Not releasing a new dvc version yet, we have some stabilization period to use. |

.gitand.dvcignore in fs❗ I have followed the Contributing to DVC checklist.
📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.
Thank you for the contribution - we'll try to review it as soon as possible. 🙏