Skip to content

Performance improvement of dvcignore#3967

Merged
pared merged 11 commits into
treeverse:masterfrom
karajan1001:fix3869
Jun 15, 2020
Merged

Performance improvement of dvcignore#3967
pared merged 11 commits into
treeverse:masterfrom
karajan1001:fix3869

Conversation

@karajan1001
Copy link
Copy Markdown
Contributor

@karajan1001 karajan1001 commented Jun 7, 2020

Fix #3869

  • ❗ I have followed the Contributing to DVC checklist.

  • 📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here. If the CLI API is changed, I have updated tab completion scripts.

  • ❌ I will check DeepSource, CodeClimate, and other sanity checks below. (We consider them recommendatory and don't expect everything to be addressed. Please fix things that actually improve code or fix bugs.)

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

karajan1001 and others added 7 commits May 13, 2020 21:59
1. add new command `dvc remote rename <old> <new>`.
2. add tests for this new command.
Co-authored-by: Peter Rowlands (변기호) <peter@pmrowla.com>
fix#3869
1.Use big regex.
@efiop efiop requested a review from pared June 8, 2020 08:46
Copy link
Copy Markdown
Contributor

@pared pared left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, though I would not close #3869 just yet.
@karajan1001 you mentioned the flashtext library, and I could not find the description of the experiment conducted, that gave the author conclusion that it makes sense to use trie-based approach only above some number of keywords. What about matching few gitignore patterns against hundreds of files? Maybe there could be gain too.

Also, I wonder whether re.compile has limit of pattern size. Checked against 100k, seems to not be a problem. I did not find resources pointing to the limit.

@pared
Copy link
Copy Markdown
Contributor

pared commented Jun 8, 2020

@karajan1001 Windows build is failing.

Comment thread dvc/ignore.py Outdated
@karajan1001
Copy link
Copy Markdown
Contributor Author

karajan1001 commented Jun 10, 2020

LGTM, though I would not close #3869 just yet.
@karajan1001 you mentioned the flashtext library, and I could not find the description of the experiment conducted, that gave the author conclusion that it makes sense to use trie-based approach only above some number of keywords. What about matching few gitignore patterns against hundreds of files? Maybe there could be gain too.

Also, I wonder whether re.compile has limit of pattern size. Checked against 100k, seems to not be a problem. I did not find resources pointing to the limit.

Maybe this is because regex is written in C, and flashtext is in Python. flashtext has a lower level O() but a much bigger constant term. Use a C version of Aho-Corasick Automata (variation of trie-tree) would beat regex in any cases.

@karajan1001
Copy link
Copy Markdown
Contributor Author

karajan1001 commented Jun 13, 2020

@pared @Suor
Because there is still something wrong with my ASV. I can only give a command-line result
Screenshot from 2020-06-13 20-40-30
This is the current version.

[ 81.82%] · For dvc commit fc42ca72 <1.0.0a1>:
[ 90.34%] ··· status.DVCIgnoreBench.time_status                                                                                                                          501±3ms
[ 90.91%] ··· status.DVCStatusBench.time_status                                                                                                                         84.3±2ms

This is the result after this submit.

[ 90.91%] · For dvc commit a86cedbf <1.0.0a1test>:
[ 99.43%] ··· status.DVCIgnoreBench.time_status                                                                                                                          104±1ms
[100.00%] ··· status.DVCStatusBench.time_status                                                                                                                       79.8±0.5ms

The new version took 24ms, and the old one took 417ms. (17.35 times faster)

And In the worst case where the rules in the order that ignore rules and include rules are separated from each other. And there is no big regex exists(Hardly could this happened in real).
The result is:
Screenshot from 2020-06-13 20-50-08

The new version took 84ms, and the old one took 416ms. (4.95 times faster)

@Suor Suor self-requested a review June 15, 2020 08:19
Copy link
Copy Markdown
Contributor

@Suor Suor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one looks good. We should merge it.

The thing is bothering me is that this doesn't help with many ignore files. Can we make a bench for that? How many ignore files do we need to be in trouble?

@pared pared merged commit a59f90f into treeverse:master Jun 15, 2020
@pared
Copy link
Copy Markdown
Contributor

pared commented Jun 15, 2020

@karajan1001, thanks for this PR!

Now, that implementation of dvcignore changed, @Suor point makes sense. I did not foresee that in treeverse/dvc-bench#30
@karajan1001, as you are the original author of dvcingore benchmarks, would you be interested in reintroducing multiple dvcignore files benchmark? We could pull it to the extreme and do 30 files x 1 rule, instead of original 3 x 10.

@pared pared mentioned this pull request Jun 15, 2020
1 task
@karajan1001 karajan1001 deleted the fix3869 branch April 7, 2021 07:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

optimize dvcignore

3 participants