Skip to content

Decoupling dvcignore and fs#5812

Merged
efiop merged 42 commits into
treeverse:masterfrom
karajan1001:decoupling_dvcignore_fs
May 12, 2021
Merged

Decoupling dvcignore and fs#5812
efiop merged 42 commits into
treeverse:masterfrom
karajan1001:decoupling_dvcignore_fs

Conversation

@karajan1001
Copy link
Copy Markdown
Contributor

@karajan1001 karajan1001 commented Apr 14, 2021

  • remove dvcignore from fs objects
  • attach dvcignore to repo
  • add ignore check to functions(add, checkout, output, pipeline, etc)
  • problem of subrepo
  • problem .git and .dvc ignore in fs
  • pass all tests
  • final check ( some new tests)
  • Performance guarantee.

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

@karajan1001 karajan1001 marked this pull request as draft April 14, 2021 07:58
@karajan1001 karajan1001 changed the title Decoupling dvcignore and fs [WIP] Decoupling dvcignore and fs Apr 14, 2021
@efiop efiop added the refactoring Factoring and re-factoring label Apr 14, 2021
@karajan1001
Copy link
Copy Markdown
Contributor Author

karajan1001 commented Apr 15, 2021

Currently affected files ( with at least one fs.walk, fs.isfile, fs.isdir or fs.exists in it)

dvc/checkout.py
dvc/config.py
etc

Can be classified into five levels

  1. fs level:
dvc/fs/fsspec_wrapper.py
dvc/fs/memory.py
etc
  1. object level:
dvc/objects/db/base.py
dvc/objects/db/local.py
etc
  1. output level
dvc/output/base.py
dvc/dependency # inheriant from output
etc
  1. command level:
dvc/checkout.py
dvc/repo/add.py
etc
  1. file operations:
dvc/config.py
dvc/dvcfile.py
etc

Originally dvcignore check was done in fs level, and affect all of the file levels above. In this PR, we have two choices:

  1. on output level.
  2. on command level.

Currently the behavior of DVC or more like in output level. Ignore checking not only affect untracked files but also those which had already been added to the repo. For example

$ dvc add a
$ echo a >> .dvcignore
$ dvc status

would throw an exception

ERROR: Path '<dvc_root>/a' is ignored by
.dvcignore:5:a

And in documents

⚠️ dvc run and dvc repro might remove ignored files. If they are not produced by a pipeline stage, they can be lost permanently.

Keep in mind that when you add .dvcignore patterns that affect an existing output, its status will change and DVC will behave as if that affected files were deleted.

For comparison, in Git .gitignore only affect untracked files. The behavior of gitignore is more like on command level (this is actually what dulwich do), and this implementation is more simple and its behavior is easier to predict.

Should we keep our current logic or switch to a more Git-like one?

BTW, even if we decided to switch, it shouldn't be included in this PR.
@efiop @dberenbaum

@karajan1001
Copy link
Copy Markdown
Contributor Author

@efiop
Besides the functional problem above, we have another two code problem:

  1. dvcignore used to couple with sub-repo and they need to be decoupled now.
  2. Without dvcignore, .git, .dvc directory are now exposed to the filesystem. We need a new way to mask them.

@efiop
Copy link
Copy Markdown
Contributor

efiop commented Apr 15, 2021

@karajan1001 Great summary! Indeed, there is an inconsistency where dvcignore could ignore already tracked output, even though it really should behave like git add -f for .gitignoreed file. I think it is more of an accident caused by the current fs & dvcignore coupling and indeed could be changed. I'm not sure there are even tests that check for that behavior right now.

dvcignore used to couple with sub-repo and they need to be decoupled now.

Most of user-facing commands don't really walk into subrepos when walking the host repo, but when given a path that is in subrepo - will indeed go into the subrepo (it is a bit confusing, but it is really a more simple logic), while we do have some tests that check for recursive host walk that has to walk into subrepos as well. The latter one could be considered to be changed.

Without dvcignore, .git, .dvc directory are now exposed to the filesystem. We need a new way to mask them.

Hopefully most of such cases could be covered by just filtering out fs.walk() results with .dvcignore, so it won't traverse into .git and .dvc when it is not needed.

@skshetry
Copy link
Copy Markdown
Collaborator

@efiop, @karajan1001, what are the reasons for the decoupling? IIRC we depend on this behaviour in a lot of places
(stage collection, parametrization, checkouts, add/repro, get/import), so I see this as a necessity in a lot of places.

(maybe what we need is a better API but is decoupling/duplication worth it?).

@efiop
Copy link
Copy Markdown
Contributor

efiop commented Apr 15, 2021

@skshetry Bugs like #5605 and migration to fsspec. Those could be solved in the current arch, but it gets clunky, so trying to explore if more explicit approach of applying dvcignore only where it is needed (similar to git) would fit better. We might decide that it is not worth it too.

@karajan1001
Copy link
Copy Markdown
Contributor Author

@karajan1001 Great summary! Indeed, there is an inconsistency where dvcignore could ignore already tracked output, even though it really should behave like git add -f for .gitignoreed file. I think it is more of an accident caused by the current fs & dvcignore coupling and indeed could be changed. I'm not sure there are even tests that check for that behavior right now.

So, it is not a designed behavior but accidentally behaves like this.

dvcignore used to couple with sub-repo and they need to be decoupled now.

Most of user-facing commands don't really walk into subrepos when walking the host repo, but when given a path that is in subrepo - will indeed go into the subrepo (it is a bit confusing, but it is really a more simple logic), while we do have some tests that check for recursive host walk that has to walk into subrepos as well. The latter one could be considered to be changed.

The latter one could be considered to be changed. Sorry I didn't understand this.

@efiop
Copy link
Copy Markdown
Contributor

efiop commented Apr 15, 2021

@karajan1001 I mean that we can change the second part - behaviour that is only tested by us in syntetic tests but not actually used in any user-facing functionality. E.g. an ability to fs.walk("host_root") and go through host repo and into the subrepo could be considered to be dropped, as it is not used by any dvc commands.

Have to point out that this is an open-ended task, if you see that the current approach works better - we can keep it.

Comment thread dvc/fs/local.py Outdated
Comment thread dvc/fs/local.py Outdated
@karajan1001
Copy link
Copy Markdown
Contributor Author

Current subrepo problem.

Originally subrepo's dvcignore attached to its own fs. So each fs has its own dvcignore patterns on it, and can be controlled individually. Now the fs of subrepo is directly called by the root repo 's fs. And our dvcginore pattern is bound on subrepo ignore. We can't get subrepo files while ignoring subrepo's patterns.

Two solutions:

  1. Root repo didn't get subrepo's files directly.
  2. Dvcginore decoupling patternsignore andsubrepo`'s ignore.

@karajan1001 karajan1001 force-pushed the decoupling_dvcignore_fs branch 2 times, most recently from ea8262d to 6770760 Compare April 27, 2021 06:26
@karajan1001 karajan1001 self-assigned this Apr 28, 2021
@karajan1001 karajan1001 changed the title [WIP] Decoupling dvcignore and fs Decoupling dvcignore and fs Apr 28, 2021
@karajan1001 karajan1001 marked this pull request as ready for review April 28, 2021 07:35
@pared pared self-requested a review April 28, 2021 10:14
@karajan1001
Copy link
Copy Markdown
Contributor Author

karajan1001 commented Apr 30, 2021

tests previous version after this PR diff
add.Add.time_cats_dogs copy 38.6±0.7s 40.0±0.6s 3.6%
add.Add.time_cats_dogs symlink 30.9±0.1s 30.1±0.6s - 2.6%
add.Add.time_cats_dogs hardlink 38.1±0.1s 37.0±0.2s - 2.9%
checkout.CheckoutBench.time_cats_dogs copy 16.8±0.4s 16.8±0.3s 0.0%
checkout.CheckoutBench.time_cats_dogs symlink 8.27±0.4s 8.45±0.04s 2.1%
checkout.CheckoutBench.time_cats_dogs hardlink 13.4±0.4s 13.7±0.1s 2.2%
collect.CollectBench.time_stages_collection 2.58±0s 2.18±0s - 15.5 %
collect.TraverseGitRepoBench.time_repo_traversing 4.14±0s 5.28±0s 27.1%
imports.ImportBench.time_imports 13.9±0m 14.1±0m 1.4 %
init.InitNoScmBench.time_init 177±5ms 177±2ms 0.0%
init.InitScmBench.time_init 197±3ms 196±1ms - 0.5 %
push.PushBench.time_cats_dogs failed failed
startup.StartupBench.time_startup 107±0ms 110±0ms 2.8%
status.DVCIgnoreBench.time_status 403±0ms 406±0ms 0.7%
status.DVCStatusBench.time_status 401±0ms 401±0ms 0.0%

Most of the tests' performance changed less than 4% after this PR. Only two of them show an obvious diff
collect.CollectBench.time_stages_collection accelerates about 15%
collect.TraverseGitRepoBench.time_repo_traversing slows about 27%
Currently, I'm looking for the reason for this.

@karajan1001
Copy link
Copy Markdown
Contributor Author

karajan1001 commented Apr 30, 2021

tests previous version after this PR diff
add.Add.time_cats_dogs copy 38.6±0.7s 40.0±0.6s 3.6%
add.Add.time_cats_dogs symlink 30.9±0.1s 30.1±0.6s - 2.6%
add.Add.time_cats_dogs hardlink 38.1±0.1s 37.0±0.2s - 2.9%
checkout.CheckoutBench.time_cats_dogs copy 16.8±0.4s 16.8±0.3s 0.0%
checkout.CheckoutBench.time_cats_dogs symlink 8.27±0.4s 8.45±0.04s 2.1%
checkout.CheckoutBench.time_cats_dogs hardlink 13.4±0.4s 13.7±0.1s 2.2%
collect.CollectBench.time_stages_collection 2.58±0s 2.18±0s - 15.5 %
collect.TraverseGitRepoBench.time_repo_traversing 4.14±0s 5.28±0s 27.1%
imports.ImportBench.time_imports 13.9±0m 14.1±0m 1.4 %
init.InitNoScmBench.time_init 177±5ms 177±2ms 0.0%
init.InitScmBench.time_init 197±3ms 196±1ms - 0.5 %
push.PushBench.time_cats_dogs failed failed
startup.StartupBench.time_startup 107±0ms 110±0ms 2.8%
status.DVCIgnoreBench.time_status 403±0ms 406±0ms 0.7%
status.DVCStatusBench.time_status 401±0ms 401±0ms 0.0%
Most of the tests' performance changed less than 4% after this PR. Only two of them show an obvious diff
collect.CollectBench.time_stages_collection accelerates about 15%
collect.TraverseGitRepoBench.time_repo_traversing slows about 27%
Currently, I'm looking for the reason for this.

Extra time cost comes from two steps.

  1. Now we update two trees,
  2. Time cost in searching all dirs (potential subrepo).

After solving these two problems. New version runs a bit faster than before. This might because we move subrepo finding from __call__ to __update__ which would run only once.

tests previous version after this PR after optimization diff
collect.CollectBench.time_stages_collection 2.67±0s 2.33±0s 2.27±0s - 15.0 %
collect.TraverseGitRepoBench.time_repo_traversing 4.14±0s 5.37±0s 3.96±0s -4.4%

Comment thread tests/func/test_ignore.py Outdated
Comment thread tests/func/test_ignore.py Outdated
Comment thread tests/func/test_ignore.py Outdated
Comment thread tests/func/test_ignore.py Outdated
Comment thread tests/func/test_ignore.py Outdated
Comment thread tests/func/test_ignore.py Outdated
@karajan1001
Copy link
Copy Markdown
Contributor Author

karajan1001 commented May 3, 2021

Newest result. a bit better than before.

tests previous version after this PR diff
add.Add.time_cats_dogs copy 40.4±0.7s 36.4±0.5s -9%
add.Add.time_cats_dogs symlink 30.8±0.08s 28.8±0.01s - 6.5%
add.Add.time_cats_dogs hardlink 36.5±0.2s 35.0±0.04s - 4.2%
checkout.CheckoutBench.time_cats_dogs copy 15.6±0.2s 15.4±0.02s -1.1%
checkout.CheckoutBench.time_cats_dogs symlink 7.69±0.02s 7.59±0.01s -1.4%
checkout.CheckoutBench.time_cats_dogs hardlink 12.4±0.05s 12.5±0.02s 0.8%
collect.CollectBench.time_stages_collection 2.49±0s 2.15±0s - 13.7 %
collect.TraverseGitRepoBench.time_repo_traversing 4.09±0s 3.87±0s -5.4%
imports.ImportBench.time_imports 14.6±0m 14.8±0m 1.3 %
init.InitNoScmBench.time_init 175±0.9ms 175±0.7ms 0.0%
init.InitScmBench.time_init 193±0.7ms 194±0.7ms 0.5 %
push.PushBench.time_cats_dogs failed failed
startup.StartupBench.time_startup 104±0ms 104±0ms 0.0%
status.DVCIgnoreBench.time_status 403±0ms 401±0ms -0.5%
status.DVCStatusBench.time_status 402±0ms 398±0ms -1.0%

@karajan1001 karajan1001 force-pushed the decoupling_dvcignore_fs branch from 7758d1c to 798abbf Compare May 3, 2021 08:08
@karajan1001 karajan1001 requested review from efiop and pared May 3, 2021 09:15
Comment thread dvc/checkout.py Outdated
Comment on lines 180 to 184
Copy link
Copy Markdown
Contributor

@efiop efiop May 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

walk_files should probably be files to avoid confusion with the function.

How about:

Suggested change
if fs.scheme == Schemes.LOCAL and fs.repo:
walk_files = fs.repo.dvcignore(fs.walk(path_info), walk_files=True)
else:
walk_files = fs.walk_files(path_info)
existing_files = set(walk_files)
files = set(cache.repo.dvcignore(fs, path_info, walk_files=True))

unlike fs.repo, cache.repo is more persistent (though temporarily) so we can rely on it more.

Or maybe even:

    files = set(cache.repo.dvcignore.walk_files(fs, path_info))

so that dvcignore has walk and walk_files methods.

Also, for the record, I'm refactoring this right now to decouble repo from fs, and fs.repo will go away soon, as well as cache.repo. We'll likely start passing dvcignore(as well as state) to checkout as kwargs. No action needed in this PR, just FYI.

Copy link
Copy Markdown
Contributor Author

@karajan1001 karajan1001 May 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dvc/objects/stage.py:        walk_iterator = odb.repo.dvcignore.walk_files(fs.walk(path_info))
dvc/utils/fs.py:            walk_iterator = dvcignore.walk_files(fs.walk(path))
dvc/checkout.py:            cache.repo.dvcignore.walk_files(fs.walk(path_info))
dvc/fs/repo.py:        yield from repo.dvcignore.walk(
dvc/repo/collect.py:            target_infos.extend(repo.dvcignore.walk_files(fs.walk(path_info)))
dvc/repo/add.py:                repo.dvcignore.walk_files(repo.fs.walk(target)),
dvc/repo/stage.py:        for root, dirs, files in self.repo.dvcignore.walk(

Currently dvcignore are used in these 7 places.

4 of them could be derived directly from repo

dvc/repo/collect.py:            target_infos.extend(repo.dvcignore.walk_files(fs.walk(path_info)))
dvc/repo/add.py:                repo.dvcignore.walk_files(repo.fs.walk(target)),
dvc/repo/stage.py:        for root, dirs, files in self.repo.dvcignore.walk(
dvc/fs/repo.py:        yield from repo.dvcignore.walk(

2 come from cache ( temporary solution )

dvc/objects/stage.py:        walk_iterator = odb.repo.dvcignore.walk_files(fs.walk(path_info))
dvc/checkout.py:            cache.repo.dvcignore.walk_files(fs.walk(path_info))

And I add a repo variable to the class State in the last one.

dvc/utils/fs.py:            walk_iterator = dvcignore.walk_files(fs.walk(path))

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, maybe we should pass a list of individual files instead of path and file systems other things in the last three cases.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, maybe we should pass a list of individual files instead of path and file systems other things in the last three cases.

Could you elaborate, please?

Comment thread dvc/fs/repo.py Outdated
Comment thread dvc/fs/repo.py Outdated
@karajan1001
Copy link
Copy Markdown
Contributor Author

Any methods to suppress this other than using # pylint?

I guess we need to override it and raise NotImplementedError

I remember that an abstract method must be implemented ( can't be abstract ) in subclasses in some other languages.

image
Hmmm, It didn't work as well in pylint.

Comment thread dvc/ignore.py Outdated
Comment thread dvc/fs/git.py Outdated
Comment thread dvc/repo/fetch.py
jobs=jobs,
follow_subrepos=False,
)
obj = stage(odb, path_info, repo.repo_fs, "md5", jobs=jobs,)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an interesting question of whether we should apply dvcignore here instead of relying on it being applied in repo_fs 🤔 For now we could keep it as is, but it might be a good idea for the future.

Copy link
Copy Markdown
Contributor Author

@karajan1001 karajan1001 May 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@efiop add dvcignore here would cause dvc fetch only downloading those not ignored caches. The problem would be with those caches inside subrepos. Subrepo problems are not solved thoroughly. Maybe it is better to solve it after we had fixed and clear subrepo documents and usage.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread dvc/output/base.py Outdated
Comment thread dvc/output/base.py Outdated
Comment thread dvc/repo/__init__.py Outdated
Comment thread dvc/utils/fs.py
def get_mtime_and_size(path, fs, dvcignore=None):
import nanotime

if fs.isdir(path):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably also use dvcignore, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, without it, dvc tracked directories would be regarded as modified after dvcignored changing.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So should it check that the path is not dvcignored first?

On a related topic, right now we pass dvcignore to the state constructor so we could pass it through to get_mtime_and_size, which has an implication of us always using dvcignore even though we might not want to. So maybe it could be a good idea to pass dvcignore explicitly to State methods that might need it.

State is a bit behind right now on fsspec changes and still uses os.path instead of fs (same as get_mtime_and_size), but that should change in the near future. Thinking if maybe dvcignore as a wrapper (blast from the past 🙂 ) would turn to be more fitting here.

Another note is that DvcignoreFilter right now is combining patter matching and filesystem operations, which has been a source of confusion, as we try to use os.path and introduce a local filesystem bias, which results in us using, say, os.path.isdir in is_ignored instead of fs.isdir

Comment thread dvc/output/base.py
Comment on lines 278 to 281
def isdir(self):
if self._check_path_dvcignore(self.path_info):
return False
return self.fs.isdir(self.path_info)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dvcignore can ignore a dir (e.g. dir/) or a file, which are two different cases, so dvcignore will need to call isdir internally using os.path, but we should pass self.fs to it instead and make it use that. Or am I missing something?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@efiop , dvcignore didn't test where the path is a file or a dir. The only difference of dirs and files are the matching of pattern like paths/.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karajan1001 Right, I'm saying that you are using is_ignored which will interpret path as a local path and will try to use os.path.isdir https://github.com/iterative/dvc/blob/e7571ab3251ef4417fa196e7ad45b7f6f8a42ee3/dvc/ignore.py#L354 on it, while it should use fs.isdir, because fs might be a GitFileSystem, for example.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, right now dvcignore should only work for fs(and path) that only belong to dvcignore.fs(filesystem that dvcignore was collected from).

@efiop
Copy link
Copy Markdown
Contributor

efiop commented May 10, 2021

@karajan1001 @pared We are in a bit of a mess with legacy walk() and fsspec migration, the end goal is fsspec's walk(). For now just adding walk to fsspec_wrapper and adding a pylint ignore to that line is fine, we'll adjust it in a followup.

@karajan1001 karajan1001 requested a review from efiop May 11, 2021 06:45
Comment thread dvc/output/base.py
Comment thread dvc/output/base.py Outdated
Comment thread dvc/dvcfile.py
Comment on lines +127 to +128
is_ignored = self.repo.dvcignore.is_ignored_file(self.path)
return self.repo.fs.exists(self.path) and not is_ignored
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love how we now use is_ignored_file instead of relying on implicit is_ignored(includes is_ignored_dir which we don't care about and can be harmful) inside fs ❤️

Comment thread dvc/fs/azure.py


class AzureFileSystem(FSSpecWrapper):
class AzureFileSystem(FSSpecWrapper): # pylint:disable=abstract-method
Copy link
Copy Markdown
Contributor

@efiop efiop May 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a less dangerous way: you could add walk to FSSpecWrapper.

EDIT: on the other hand we've been using it in other FileSystems for awhile 🙁

Comment on lines 467 to -478
assert set(actual) == set(expected)
assert len(actual) == len(expected)

assert fs.isfile(tmp_dir / "lorem") is True
assert fs.isfile(tmp_dir / "dir" / "repo" / "foo") is False
assert fs.isdir(tmp_dir / "dir" / "repo") is False
assert fs.isdir(tmp_dir / "dir") is True

assert fs.isdvc(tmp_dir / "lorem") is True
assert fs.isdvc(tmp_dir / "dir" / "repo" / "dir1") is False

assert fs.exists(tmp_dir / "dir" / "repo.txt") is True
assert fs.exists(tmp_dir / "repo" / "ipsum") is False

Copy link
Copy Markdown
Contributor

@efiop efiop May 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, this is actually what we were talking about in #5812 (comment) . Here we have a very strange situation where fs.walk is dvcignore aware, but isdir/isdfile are not, which is bad. So we need to either use dvcignore in both, or just apply dvcignore where it is needed on top of RepoFileSystem, effectively decoupling it and dvcignore.

Copy link
Copy Markdown
Contributor

@efiop efiop May 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simple fix for now is to just use dvicgnore in RepoFileSystems isfile/isdir/exists, but we might want to think about futher decoupling here in the future or could discuss right away. The change is not going to be simple there because RepoFileSystem was designed that way, but we could give it a shot.

Comment thread dvc/fs/repo.py
def isdir(self, path): # pylint: disable=arguments-differ
fs, dvc_fs = self._get_fs_pair(path)

if dvc_fs and dvc_fs.repo.dvcignore.is_ignored_dir(path):
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ignore_subrepos is need here.

@efiop
Copy link
Copy Markdown
Contributor

efiop commented May 12, 2021

Had a great discussion with @karajan1001 about dvcignore and related things. To summarize the stuff to consider and handle in the followups:

  1. Potential RepoFileSystem and dvcignore decoupling

  2. Scan for potential existing tests that were relying on not using dvcignore (esp regarding subrepos). Those tests are letting us down because of bad previous architecture 🙁 This needs to be solved before the next release.

  3. We should make State use fs instead of os.path and possibly stop passing dvcignore to it into constructor or into separate methods (by possibly wrapping the filesystem with dvcignore before passing the fs to the methods)

  4. dvcignore can be used even more granularly in thngs like stages and outputs (and other). E.g. only when dvc-adding, so we don't waste dvcignore calls for no reason.

  5. We should stop using abspath in dvcignore. This is related to fsspec migration and making our filesystems only work with fspath-s

  6. Stop using PathInfo in fs/dvcignore related applications. Again related to fsspec migration and making our filesystems only work with strings fspaths-s

Merging for now to unblock followups. Not releasing a new dvc version yet, we have some stabilization period to use.

@efiop efiop merged commit b349b39 into treeverse:master May 12, 2021
@karajan1001 karajan1001 deleted the decoupling_dvcignore_fs branch May 12, 2021 02:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

refactoring Factoring and re-factoring

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants