-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-4827: [C++] Implement benchmark comparison #4141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-4827: [C++] Implement benchmark comparison #4141
Conversation
bkietz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comments, looks lovely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be useful to provide some progress output as each test is run so users know nothing is hung.
Maybe benchmarks could be run one at a time with messages naming each?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to commit, but it would requires some more thinking:
-
Rework how to capture results from google benchmark (right now from stdout). We can use
--benchmark_output, then we'll get "progress" in stdout. -
archerystdout is now clobbered with this result, so either we redirect the previous point output into stderr, or into the logger.
I'm not very satisfied with either answer. Note that in all cases, you can get some feedback with --debug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One way to do it would be:
| @property | |
| def suite_name(self): | |
| return os.path.splitext(os.path.basename(self.bin))[0] | |
| def results(self): | |
| argv = ["--benchmark_format=json", "--benchmark_repetitions=20"] | |
| results = { "benchmarks": [] } | |
| for name in self.list_benchmarks(): | |
| print(f"running {self.suite_name}.{name}") | |
| result = json.loads(self.run(*argv, f"--benchmark_filter={name}", | |
| stdout=subprocess.PIPE, | |
| stderr=subprocess.PIPE).stdout) | |
| results["context"] = result["context"] | |
| results["benchmarks"] += result["benchmarks"] | |
| return results |
re stdout clobbering: the output already seems clobbered by things like 'ninja: no work to do.`
Maybe it would be better to provide the option to specify filenames for comparison (and/or benchmark) output json, rather than rely on stdio?
|
@fsaintjacques is it still WIP? |
|
@kszucs not anymore! |
|
@fsaintjacques please resolve the conflict |
24fc1dc to
512ae64
Compare
Codecov Report
@@ Coverage Diff @@
## master #4141 +/- ##
==========================================
+ Coverage 87.76% 89.18% +1.42%
==========================================
Files 758 617 -141
Lines 92231 82202 -10029
Branches 1251 0 -1251
==========================================
- Hits 80944 73310 -7634
+ Misses 11166 8892 -2274
+ Partials 121 0 -121
Continue to review full report at Codecov.
|
4b2f180 to
d27160d
Compare
d27160d to
c371921
Compare
|
I won't be to pedantic about this, because it looks good in general, but hard to predict the arising problems without actually running and using it. I'll merge after a positive attempt to try it. |
|
Please be pedantic, I'm not familiar with python's best practices. I just followed your style in ursabot/crossbow. |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks basically sound. Here are some comments, you may not necessarily want to act on all of them.
| return f"BenchmarkSuite[name={name}, benchmarks={benchmarks}]" | ||
|
|
||
|
|
||
| def regress(change, threshold): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of this, I would probably expect a Benchmark.does_regress(baseline) method (that could ultimately take into account the standard deviation and the less_is_better property). Of course, that can be later refactored.
| n = len(values) | ||
| mean = sum(values) / len(values) | ||
| sum_diff = sum([(val - mean)**2 for val in values]) | ||
| stddev = (sum_diff / (n - 1))**0.5 if n > 1 else 0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, since you're requiring Python 3 (I saw some f-strings), you should be aware that Python now has a simple statistics module in its standard library.
Though it doesn't support arbitrary quantiles (there's an issue open for that : https://bugs.python.org/issue35775)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dropped it (locally, going to update) in favor of using panda, do you think it's overkill to import it as a library? I think it's going to be useful one day or the other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pandas sounds overkill for this, since you're dealing with arrays. Numpy would be enough.
| return float(new - old) / abs(old) | ||
|
|
||
|
|
||
| DEFAULT_THRESHOLD = 0.05 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this? Add a comment?
|
As a side note, at some point you'll probably want to run |
|
@pitrou updated with your comments, flake8 should pass soon. |
|
Thanks. I think the CI failure is unrelated. |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. I trust that you acted on previous review comments.
This script/library allows comparing revisions/builds.