optimize local status#3867
Merged
Merged
Conversation
Contributor
Author
|
Will look if I can add a benchmark like that to dvc-bench before this is merged. |
efiop
commented
May 24, 2020
Contributor
Author
There was a problem hiding this comment.
Also, we've discussed this before, but all of these methods don't take dvcignore into account, which is wrong.
30a474c to
d81b17a
Compare
Contributor
|
Thank you @efiop ! Also, I noticed that the more I have entries in my |
Contributor
Author
|
@courentin Thanks for the feedback! Yeah, we are definitely not fully done optimizing it, but this seems like a good start 😉 |
efiop
commented
May 24, 2020
added 5 commits
May 25, 2020 01:22
We've done a lot of optimizations lately, which made unpacked dir trick obsoleted.
It is ~5.5 times slower than joining by hand.
On a repo with a dvcignore with 1 pattern and a directory with 400K files, `dvc status` now takes ~8 sec instead of ~30 sec. To achieve that, we make some assumptions about the paths formats that we are dealing with, so we could use simpler logic instead of using very slow `relpath`, `abspath` etc on every entry in a directory. It is also clear that CleanTree behavior is inconsistent (even tests expect very different outputs from it), so we will need to look into this later.
efiop
commented
May 25, 2020
| ), | ||
| ("dont_ignore.txt", ["dont_ignore"], False), | ||
| ("dont_ignore.txt", ["dont*", "!dont_ignore.txt"], False), | ||
| ("../../../something.txt", ["**/something.txt"], False), |
Contributor
Author
There was a problem hiding this comment.
it is too expensive for match to try to resolve paths like this.
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Last
dvc statusresults in ~50sec(way over 1minute if there is no unpacked dir) on current master and ~8sec with this PR.The main issue that was causing this is
relpaththat we were calling for each file that we check for match in dvcignore. Plus some alog-the-way optimizations likeos.path.joinand savingstats onis_protected. Also removingunpacked diroptimization, as it is no longer needed, mainly because of the recent optimizations that are done.The issue was reported by the user https://discordapp.com/channels/485586884165107732/563406153334128681/713338211128311808 and my reproduction script is based on the dataset structure that he has.
❗ I have followed the Contributing to DVC checklist.
📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here. If the CLI API is changed, I have updated tab completion scripts.
❌ I will check DeepSource, CodeClimate, and other sanity checks below. (We consider them recommendatory and don't expect everything to be addressed. Please fix things that actually improve code or fix bugs.)
Thank you for the contribution - we'll try to review it as soon as possible. 🙏