benchmark for dvc status with different numbers of ignore rules by karajan1001 · Pull Request #30 · treeverse/dvc-bench

karajan1001 · 2020-05-31T02:10:15Z

20 ignore rules with 10k files tracked

3 ignore rules with 10k files tracked

Empty ignore rules file with 10k files tracked

fix#18

1. reduce files in data prevent `dvc add` causing timeout failure. 2. quiet mode

karajan1001 · 2020-05-31T02:27:52Z

dvc add 10K files takes quite a long time. If I use 40k files, the test would fail because of the asv timeout.

skshetry · 2020-05-31T02:33:33Z

dvc add 10K files takes quite a long time. If I use 40k files, the test would fail because of the asv timeout.

@karajan1001 would it be better to have add benchmarks with dvcignore first, rather than status`?

karajan1001 · 2020-05-31T03:01:40Z

dvc add 10K files takes quite a long time. If I use 40k files, the test would fail because of the asv timeout.

@karajan1001 would it be better to have add benchmarks with dvcignore first, rather than status`?
@skshetry Thank you.

From my understanding, .dvcignore influences both dvc add and dvc status. But the dvc add takes much longer time and .dvcignore is not the bottleneck to it. It is better to test .dvcignore with dvc status rather than dvc add.

karajan1001 · 2020-05-31T04:06:40Z

dvc add 10K files takes quite a long time. If I use 40k files, the test would fail because of the asv timeout.

This is because the test runs for several times. And in each time calculating md5 and adding files to the cache took server seconds. Now I put dvc add before copying files to data, this change can prevent adding files to the cache.

karajan1001 · 2020-05-31T06:27:50Z

dvc add 10K files takes quite a long time. If I use 40k files, the test would fail because of the asv timeout.

This is because the test runs for several times. And in each time calculating md5 and adding files to the cache took server seconds. Now I put dvc add before copying files to data, this change can prevent adding files to the cache.

After these changes, here is the result.

20 ignore rules with 10k files tracked

3 ignore rules with 10k files tracked

Empty ignore rules file with 10k files tracked

These results matches

A new test with 3 ignore files with 10 rules compare to one file with 30 rules

karajan1001 · 2020-05-31T07:58:47Z

Add one more test 1 ignore file with 30 rules compare to 3 ignore file with 10 rules each.

1 file 30 rules

3 file each has 10 rules

pared

One question

karajan1001 · 2020-06-01T08:41:21Z

Please wait for a moment.
This benchmark runs too long and itself needs optimization.

Use a module-level setup and teardown.
Set repeat to 1 and increase numbers (1 setup and run tests multi-times) ASV

karajan1001 · 2020-06-01T14:40:39Z

Please wait for a moment.
This benchmark runs too long and itself needs optimization.

Use a module-level setup and teardown.

Set repeat to 1 and increase numbers (1 setup and run tests multi-times) ASV

module-level setup and teardown runs in every benchmark. They are combined with class-level ones.
I had set repeat to 1 and numbers to 10, and the total running time cut in half on my computer. But it seems there is still something wrong with it.

The setup() runs twice, and the time_status() runs ten times for the first time, but only once for the second time. I'm asking ASV for advice.

pared · 2020-06-01T22:16:26Z

@karajan1001
Asv has something called setup_cache, some context here.

The problem with it is that this method writes pickled setup result to disk, and loads it upon next execution of benchmark. In case of DVC, that does not make sense as we are operating on a directory/repository which changes state during benchmark execution.

Some more context

karajan1001 · 2020-06-02T00:50:28Z

@karajan1001
Asv has something called setup_cache, some context here.

The problem with it is that this method writes pickled setup result to disk, and loads it upon next execution of benchmark. In case of DVC, that does not make sense as we are operating on a directory/repository which changes state during benchmark execution.

Some more context

Yes, our setup is of paths and files, not python in-memory objects.

Currently, the whole bench takes 10 minutes on my computer, only a quarter compared to the first edition.

There is only one problem remains that each setup() runs twice and in the second period, the bench itself runs only once.

pared · 2020-06-02T10:06:10Z

@karajan1001
It's most likely due to fact that we run not only benchmarks. If you inspect run_benchmarks.dvc you will note that we also profile the tests. That is the reason for the additional run.

karajan1001 · 2020-06-02T14:39:31Z

@karajan1001
It's most likely due to fact that we run not only benchmarks. If you inspect run_benchmarks.dvc you will note that we also profile the tests. That is the reason for the additional run.

Thank you @pared you gave me the right answer. According to ASV

  --profile, -p         In addition to timing, run the benchmarks through the
                        `cProfile` profiler and store the results.

And I tested that in deleting --profile the second run of setup() disappeared. I have nothing to optimize now.

pared · 2020-06-03T12:53:18Z

@karajan1001 also, I created #33, not sure whether this profiling makes sense, maybe we should do it only when we notice degradation in performance.

pared · 2020-06-03T12:54:45Z

+        )
+        assert main(["add", "data", "--quiet"]) == 0
+        shutil.copytree(dataset_path, "data/data")
+        # calculating md5


pared · 2020-06-03T14:15:09Z

Now that I think about it, shouldn't we just have a single benchmark, with, multiple ignore rules?

I mean having 0, 3 and 30 seems to me like creating tests to compare status performance between those 3 use cases.

The point of dvc-bench is to detect degradation in performance (and not see how dvcignore influences walk through directory), and for that we need benchmark checking status with any existing rules. And having 0 dvcignore rules looks to me like a status benchmark (which we don't currently have).

I would consider leaving 30 rules case as is, removing 3 rules use case and moving 0 cases to StatusBenchmark (not necessarily in this change).

What do you think @karajan1001?

karajan1001 · 2020-06-04T10:34:59Z

@karajan1001 also, I created #33, not sure whether this profiling makes sense, maybe we should do it only when we notice degradation in performance.

Sorry for the late reply, I agreed with this.

Now that I think about it, shouldn't we just have a single benchmark, with, multiple ignore rules?

I mean having 0, 3 and 30 seems to me like creating tests to compare status performance between those 3 use cases.

The point of dvc-bench is to detect degradation in performance (and not see how dvcignore influences walk through directory), and for that we need benchmark checking status with any existing rules. And having 0 dvcignore rules looks to me like a status benchmark (which we don't currently have).

I would consider leaving 30 rules case as is, removing 3 rules use case and moving 0 cases to StatusBenchmark (not necessarily in this change).

What do you think @karajan1001?

Yes, the zero dvcignore is more a status benchmark. I had changed the DVCIgnoreEmpty name several times and couldn't make
a final decision.

30 rules comparing 3 rules can tell us how the time cost grows with the increase of dvcignore. I think it is OK if the time cost grows linear to the number of rules in dvcignore. Without this assumption, we could not separate the influence of preprocessing and regex matching. But maybe this detailed performance influences should only be checked if there is an obvious performance degradation. Just like the --profile in the previous discussion. Benchmark gives a big picture, only some degradation detected should we go into the small pieces.

pared · 2020-06-05T12:12:18Z

@karajan1001

But maybe this detailed performance influences should only be checked if there is an obvious performance degradation. Just like the --profile in the previous discussion. Benchmark gives a big picture, only some degradation detected should we go into the small pieces.

Please take a note of @efiop comment under the profile issue. We can leave the profiles, in order to have analysis materials if something goes south.

karajan1001 · 2020-06-05T13:30:46Z

But maybe this detailed performance influences should only be checked if there is an obvious performance degradation. Just like the --profile in the previous discussion. Benchmark gives a big picture, only some degradation detected should we go into the small pieces.

@pared @efiop
And what about the 3 ignore rules benchmark? In my opinion with the message from --profile, we already have enough information.

pared · 2020-06-05T14:07:56Z

@karajan1001

I agree, to see the detailed difference in case 3 vs 30, we would need to compare charts for both tests. And the main point of this repo is to fulfill #8, which will be hard to programmatically define in case of comparing 2 benchmarks.

So let's get rid of it for now?

pared · 2020-06-08T09:47:33Z

@karajan1001 great change! Thank you very much!

karajan1001 added 3 commits May 30, 2020 23:26

Add new dvc ignore bench mark

ca9c03c

fix#18

Reduce files solve timeout error.

3edc8f8

1. reduce files in data prevent `dvc add` causing timeout failure. 2. quiet mode

pass isort

aee851e

skshetry requested review from efiop and pared May 31, 2020 02:42

skshetry assigned karajan1001 May 31, 2020

skshetry added the benchmark label May 31, 2020

skshetry linked an issue May 31, 2020 that may be closed by this pull request

dvcignore/tree benchmark #18

Closed

karajan1001 added 2 commits May 31, 2020 14:17

dvc add before copy files, reduce runtime

6f07cb3

black

4477e02

Add a new test

4da1f7e

A new test with 3 ignore files with 10 rules compare to one file with 30 rules

karajan1001 mentioned this pull request May 31, 2020

optimize dvcignore treeverse/dvc#3869

Closed

1 task

pared suggested changes May 31, 2020

View reviewed changes

Comment thread benchmarks/status.py Outdated

Remove 3x10 bench

a7fef1b

setup only run once

0e36cb4

pared reviewed Jun 3, 2020

View reviewed changes

pared approved these changes Jun 3, 2020

View reviewed changes

Rename and simplify

f27612c

pared merged commit 24cf942 into treeverse:master Jun 8, 2020

pared mentioned this pull request Jun 15, 2020

Performance improvement of dvcignore treeverse/dvc#3967

Merged

3 tasks

karajan1001 deleted the fix_18 branch December 15, 2022 10:17

Conversation

karajan1001 commented May 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

karajan1001 commented May 31, 2020

Uh oh!

skshetry commented May 31, 2020

Uh oh!

karajan1001 commented May 31, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

karajan1001 commented May 31, 2020

Uh oh!

karajan1001 commented May 31, 2020

Uh oh!

karajan1001 commented May 31, 2020

Uh oh!

pared left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

karajan1001 commented Jun 1, 2020

Uh oh!

karajan1001 commented Jun 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pared commented Jun 1, 2020

Uh oh!

karajan1001 commented Jun 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pared commented Jun 2, 2020

Uh oh!

karajan1001 commented Jun 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pared commented Jun 3, 2020

Uh oh!

pared Jun 3, 2020

Choose a reason for hiding this comment

Uh oh!

pared commented Jun 3, 2020

Uh oh!

karajan1001 commented Jun 4, 2020

Uh oh!

pared commented Jun 5, 2020

Uh oh!

karajan1001 commented Jun 5, 2020

Uh oh!

pared commented Jun 5, 2020

Uh oh!

pared commented Jun 8, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

karajan1001 commented May 31, 2020 •

edited

Loading

karajan1001 commented May 31, 2020 •

edited

Loading

karajan1001 commented Jun 1, 2020 •

edited

Loading

karajan1001 commented Jun 2, 2020 •

edited

Loading

karajan1001 commented Jun 2, 2020 •

edited

Loading