dvc: optimize Git.is_tracked()#3053
Conversation
| # There are 4 stages, see BaseIndexEntry.stage | ||
| return any((path, i) in self.repo.index.entries for i in (0, 1, 2, 3)) |
There was a problem hiding this comment.
There might be a new index version released at some point and this might break in a non-obvious way.
There was a problem hiding this comment.
Do you you have any benchmark-ish results on how much problems this is causing us right now?
There was a problem hiding this comment.
These are not index versions. From BaseIndexEntry.stage docstring:
Stage of the entry, either:
* 0 = default stage
* 1 = stage before a merge or common ancestor entry in case of a 3 way merge
* 2 = stage of entries from the 'left' side of the merge
* 3 = stage of entries from the right side of the merge
:note: For more information, see http://www.kernel.org/pub/software/scm/git/docs/git-read-tree.html
There was a problem hiding this comment.
@Suor How much real-world problems is it causing us? Does it justify this hack at all?
There was a problem hiding this comment.
Tbh, glimpsing at it I don't quite understand what is going on.
There was a problem hiding this comment.
Benched it - doesn't help much, the reason is .index is not cached it's reread on each use.
There was a problem hiding this comment.
Listing 10k files in 10k git repo took 6:19 with old code and 5:16 with new one.
There was a problem hiding this comment.
@Suor When do we do that? When we have 10K stage outputs? We don't list files inside dirs, right?
There was a problem hiding this comment.
We list when we do dvc add -r dir.
There was a problem hiding this comment.
@Suor But we don't recomend doing that on thousands of files, and consider that a misuse. Hence why I'm questioning if we even need this change.
|
Benchmarked old and new code to walk 10k files in 10k repo, with and without index entries caching:
|
|
For 100k files in repo:
All these tests are about running |
|
@Suor what is index entries caching? Don't see it mentioned/explained anywhere around here. |
|
This is how @property
def index(self):
""":return: IndexFile representing this repository's index.
:note: This property can be expensive, as the returned ``IndexFile`` will be
reinitialized. It's recommended to re-use the object."""
return IndexFile(self)Then For benching I cached it on our @cached_property
def index(self):
return self.repo.indexand then used |
|
Summary:
|
Closes #3000.