Skip to content

Conversation

@fsaintjacques
Copy link
Contributor

This script/library allows comparing revisions/builds.

Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments, looks lovely

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be useful to provide some progress output as each test is run so users know nothing is hung.

Maybe benchmarks could be run one at a time with messages naming each?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to commit, but it would requires some more thinking:

  1. Rework how to capture results from google benchmark (right now from stdout). We can use --benchmark_output, then we'll get "progress" in stdout.

  2. archery stdout is now clobbered with this result, so either we redirect the previous point output into stderr, or into the logger.

I'm not very satisfied with either answer. Note that in all cases, you can get some feedback with --debug.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way to do it would be:

Suggested change
@property
def suite_name(self):
return os.path.splitext(os.path.basename(self.bin))[0]
def results(self):
argv = ["--benchmark_format=json", "--benchmark_repetitions=20"]
results = { "benchmarks": [] }
for name in self.list_benchmarks():
print(f"running {self.suite_name}.{name}")
result = json.loads(self.run(*argv, f"--benchmark_filter={name}",
stdout=subprocess.PIPE,
stderr=subprocess.PIPE).stdout)
results["context"] = result["context"]
results["benchmarks"] += result["benchmarks"]
return results

re stdout clobbering: the output already seems clobbered by things like 'ninja: no work to do.`

Maybe it would be better to provide the option to specify filenames for comparison (and/or benchmark) output json, rather than rely on stdio?

@kszucs
Copy link
Member

kszucs commented Apr 15, 2019

@fsaintjacques is it still WIP?

@fsaintjacques fsaintjacques marked this pull request as ready for review April 15, 2019 14:44
@fsaintjacques fsaintjacques changed the title [WIP] ARROW-4827: [C++] Implement benchmark comparison ARROW-4827: [C++] Implement benchmark comparison Apr 15, 2019
@fsaintjacques
Copy link
Contributor Author

@kszucs not anymore!

@kszucs
Copy link
Member

kszucs commented Apr 15, 2019

@fsaintjacques please resolve the conflict

@fsaintjacques fsaintjacques force-pushed the ARROW-4827-benchmark-comparison branch from 24fc1dc to 512ae64 Compare April 15, 2019 17:03
@codecov-io
Copy link

codecov-io commented Apr 16, 2019

Codecov Report

Merging #4141 into master will increase coverage by 1.42%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4141      +/-   ##
==========================================
+ Coverage   87.76%   89.18%   +1.42%     
==========================================
  Files         758      617     -141     
  Lines       92231    82202   -10029     
  Branches     1251        0    -1251     
==========================================
- Hits        80944    73310    -7634     
+ Misses      11166     8892    -2274     
+ Partials      121        0     -121
Impacted Files Coverage Δ
cpp/src/arrow/python/common.h 98.78% <0%> (-1.22%) ⬇️
python/pyarrow/parquet.py 92.34% <0%> (-1.14%) ⬇️
cpp/src/arrow/util/thread-pool-test.cc 97.66% <0%> (-0.94%) ⬇️
python/pyarrow/compat.py 90% <0%> (-0.48%) ⬇️
python/pyarrow/tests/test_table.py 99.61% <0%> (-0.39%) ⬇️
python/pyarrow/_parquet.pyx 90.5% <0%> (-0.39%) ⬇️
cpp/src/arrow/python/io.cc 95.38% <0%> (-0.38%) ⬇️
cpp/src/parquet/arrow/arrow-reader-writer-test.cc 95.31% <0%> (-0.17%) ⬇️
cpp/src/arrow/python/flight.cc 0.79% <0%> (-0.09%) ⬇️
python/pyarrow/_plasma.pyx 91.27% <0%> (-0.06%) ⬇️
... and 164 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 70813d7...2a953f1. Read the comment docs.

@fsaintjacques fsaintjacques force-pushed the ARROW-4827-benchmark-comparison branch 6 times, most recently from 4b2f180 to d27160d Compare April 18, 2019 15:26
@fsaintjacques fsaintjacques force-pushed the ARROW-4827-benchmark-comparison branch from d27160d to c371921 Compare April 18, 2019 18:02
@kszucs
Copy link
Member

kszucs commented Apr 19, 2019

I won't be to pedantic about this, because it looks good in general, but hard to predict the arising problems without actually running and using it. I'll merge after a positive attempt to try it.
If You have any follow-up tasks please ensure to create tickets.

@fsaintjacques
Copy link
Contributor Author

Please be pedantic, I'm not familiar with python's best practices. I just followed your style in ursabot/crossbow.

@pitrou pitrou self-requested a review April 24, 2019 12:24
Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks basically sound. Here are some comments, you may not necessarily want to act on all of them.

return f"BenchmarkSuite[name={name}, benchmarks={benchmarks}]"


def regress(change, threshold):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of this, I would probably expect a Benchmark.does_regress(baseline) method (that could ultimately take into account the standard deviation and the less_is_better property). Of course, that can be later refactored.

n = len(values)
mean = sum(values) / len(values)
sum_diff = sum([(val - mean)**2 for val in values])
stddev = (sum_diff / (n - 1))**0.5 if n > 1 else 0.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, since you're requiring Python 3 (I saw some f-strings), you should be aware that Python now has a simple statistics module in its standard library.

Though it doesn't support arbitrary quantiles (there's an issue open for that : https://bugs.python.org/issue35775)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dropped it (locally, going to update) in favor of using panda, do you think it's overkill to import it as a library? I think it's going to be useful one day or the other.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pandas sounds overkill for this, since you're dealing with arrays. Numpy would be enough.

return float(new - old) / abs(old)


DEFAULT_THRESHOLD = 0.05
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this? Add a comment?

@pitrou
Copy link
Member

pitrou commented Apr 24, 2019

As a side note, at some point you'll probably want to run flake8 (see the Travis lint script) on archery.

@fsaintjacques
Copy link
Contributor Author

@pitrou updated with your comments, flake8 should pass soon.

@pitrou
Copy link
Member

pitrou commented Apr 25, 2019

Thanks. I think the CI failure is unrelated.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. I trust that you acted on previous review comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants