Skip to content

Conversation

@fsaintjacques
Copy link
Contributor

@fsaintjacques fsaintjacques commented May 9, 2019

The goal of this change is to mark benchmarks candidate for automated regression checks. Some benchmarks were refactored for various reasons:

  • Code deduplication where needed
  • Uniform input size for benchmarks in the same suite, usually to minimize the effect of cache hierarchy.
  • Favor external repetitions over manual repetitions and mintime when possible.
  • Add cmake ARROW_BUILD_BENCHMARKS_REFERENCE to toggle reference benchmarks.
  • Remove default benchmark filter of ^Regression.
  • Remove BM_ prefix from benchmark names
  • Fixes and addons to archery to support the --pdb option for debugging and add support for benchmarks reported as items_per_seconds.
  • Adds archery benchmark list sub-command to list suites and benchmarks

@wesm
Copy link
Member

wesm commented May 9, 2019

I'm not sure why the "Regression" designation is necessary. It seems like we would want to monitor the performance of all of our benchmarks for significant changes.

@fsaintjacques
Copy link
Contributor Author

Multiple reasons:

  • some benchmarks are flaky.
  • some benchmarks exists only for point of reference (anything with Naive/Dummy in the name) for other benchmarks or hardware discover-ability, e.g. the memory latency and bandwidth benchmarks.
  • some benchmarks are heavyweight

I think it's preferable to white list than include everything and possibly increase false positive regression alerts. It also lower the time to run benchmarks since archery is explicitly passing (by default) the --benchmark_filter="^Regression" options.

@fsaintjacques
Copy link
Contributor Author

By the way, I have no attachment to Regression it can be something else, but the function name (and hence the benchmark name) paired with a regex are the easiest way to label and filter.

@wesm
Copy link
Member

wesm commented May 9, 2019

If it's an issue with reporting on flaky benchmarks, we could create a "blacklist file" to instruct the reporter not to report regressions in known flakes.

@nealrichardson
Copy link
Member

Some thoughts, for what they're worth:

  • A benefit of blacklisting vs. whitelisting (whatever the mechanism for indicating it) is that the default is to include benchmarks in regression testing, and I think that's the right default--if you're adding a benchmark, most likely the purpose is to make sure performance doesn't go down in the future. The bias should be in favor of this, and make developers have to opt out by blacklisting.
  • I'm generally of the opinion that flaky tests have negative value--you don't notice them when they pass, and when they fail, you don't believe them--and should either be fixed immediately or deleted (or skipped, if you're into that). My instinct is that the same is true for flaky benchmarks.
  • Maybe heavy benchmarks belong in a different place altogether since we aren't going to run them routinely.

@fsaintjacques
Copy link
Contributor Author

Ok, I'll just remove the filtering by default. Should we strip the BM_ out of every benchmark names? I don't see the value in keeping this.

@wesm
Copy link
Member

wesm commented May 15, 2019

It's been several days and I'm still having a hard time swallowing the "Regression*" naming convention

@fsaintjacques fsaintjacques force-pushed the ARROW-5269-regression-candidates branch 2 times, most recently from fe050bb to 0592ecb Compare May 21, 2019 15:49
@fsaintjacques
Copy link
Contributor Author

@ursabot benchmark

@ursabot
Copy link

ursabot commented May 21, 2019

I've successfully started builds for this PR

@ursabot
Copy link

ursabot commented May 21, 2019

AMD64 Ubuntu 18.04 C++ Benchmark

Build failed.

@codecov-io
Copy link

codecov-io commented May 21, 2019

Codecov Report

Merging #4285 into master will increase coverage by 0.98%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4285      +/-   ##
==========================================
+ Coverage    88.3%   89.28%   +0.98%     
==========================================
  Files         780      636     -144     
  Lines       98400    87138   -11262     
  Branches     1251        0    -1251     
==========================================
- Hits        86891    77801    -9090     
+ Misses      11273     9337    -1936     
+ Partials      236        0     -236
Impacted Files Coverage Δ
cpp/src/plasma/thirdparty/ae/ae.c 70.75% <0%> (-0.95%) ⬇️
cpp/src/plasma/test/external_store_tests.cc 100% <0%> (ø) ⬆️
cpp/src/plasma/test/client_tests.cc 100% <0%> (ø) ⬆️
cpp/src/plasma/test/serialization_tests.cc 100% <0%> (ø) ⬆️
go/arrow/ipc/writer.go
go/arrow/math/uint64_amd64.go
go/arrow/memory/memory_avx2_amd64.go
go/arrow/ipc/file_reader.go
js/src/enum.ts
go/arrow/array/builder.go
... and 144 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 68329e7...27633c9. Read the comment docs.

@fsaintjacques
Copy link
Contributor Author

@ursabot benchmark

@fsaintjacques
Copy link
Contributor Author

fsaintjacques commented May 22, 2019

@pitrou note that I added the cmake configuration flag ARROW_BUILD_BENCHMARKS_REFERENCE which toggles a C/C++ define ARROW_WITH_BENCHMARKS_REFERENCE. The goal is to avoid compiling (and running) benchmarks which are not used for monitoring purposes but only point of reference, e.g. memory latency/bandwidth, naive implementations of algorithms...

Thus if you add a new benchmark and you know that it's only used for a point of reference of the true benchmark, you should probably wrap it with said #ifdef.

@ursabot
Copy link

ursabot commented May 22, 2019

I've successfully started builds for this PR

@ursabot
Copy link

ursabot commented May 22, 2019

AMD64 Ubuntu 18.04 C++ Benchmark

Build failed with an exception.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really just a nit, but ARROW_WITH_REFERENCE_BENCHMARKS might be a better choice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, there's ARROW_BUILD_BENCHMARKS already. The convention seems to be ARROW_BUILD_xxx to enable the xxx target, and ARROW_WITH_yyy to enable support for the yyy external library.

Copy link
Contributor Author

@fsaintjacques fsaintjacques May 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting changing the s/BUILD/WITH/ or the swap of BENCHMARKS/REFERENCE?

Copy link
Member

@kszucs kszucs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a couple nits

@kszucs
Copy link
Member

kszucs commented May 24, 2019

@ursabot benchmark

@ursabot
Copy link

ursabot commented May 24, 2019

I've successfully started builds for this PR

@kszucs
Copy link
Member

kszucs commented May 24, 2019

Why did We get empty output in https://ci.ursalabs.org/#builders/73/builds/20 ?

@fsaintjacques
Copy link
Contributor Author

fsaintjacques commented May 24, 2019

@kszucs because benchmarks were renamed and thus the intersection between 2 run is empty. See https://github.com/apache/arrow/blob/master/dev/archery/archery/benchmark/compare.py#L115-L116

@ursabot
Copy link

ursabot commented May 24, 2019

AMD64 Ubuntu 18.04 C++ Benchmark

Build failed with an exception.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for doing this. Here are some comments and questions.

- Ensure that Builder benchmarks are working on inputs of the same
  (approximately) size in bytes. This allows relative comparison between
  builders.
- Renamed benchmarks by prefixing `Regression`.
- Fixed extra string copy in BuildStringDictionary.
- Use the same buffer size in all bitmap benchmarks.
- Fix some reporting numbers
- Add '\n' do diff json output adhering to jsonlines
- Add support for items_per_second metrics
- Add `--pdb` option to drop a pdb shell on uncaught exception
- Use a single data generator
- Fix multithread input size
- Exclude multithread from regressions
- Favor external repetitions over manual repetitions and mintime when
possible.
- Add cmake ARROW_BUILD_BENCHMARKS_REFERENCE to toggle reference benchmarks.
- Remove default benchmark filter of `^Regression`.
- Remove Regression prefix from benchmark
@fsaintjacques fsaintjacques force-pushed the ARROW-5269-regression-candidates branch from b8c4c41 to a9ecffd Compare May 28, 2019 15:27
@fsaintjacques
Copy link
Contributor Author

@pitrou @kszucs I addressed the comments.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @fsaintjacques . just a couple more things and it'll be good IMO :-)

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants